(Cryptographic) Hash Functions

What is a Cryptographic Hash Function?

Before answering that, let’s answer what a hash function is:

So, can the operation mod10\mod 10 be a hash function? Yes! We (computer science people) usually denominate modulo operation with the symbol %. Coming back to our example, our examples can be of arbitrary lengths, but the output will always be 1 digit.

Examples:

  • 1 % 10 = 1
  • 234982309423890238402 % 10 = 2
  • 8947548923756284395723489052347589034750234957823 % 10 = 3

But this is not a good hash function, especially if we want to use it for cryptographic purposes.

Why? Because it does not fulfil the following criteria of cryptographic hash functions (and I will explain each criterion’s importance as well).


Cryptographic Hash Function Properties:

  1. Non-reversibility (one-way function): this one is pretty self-explanatory, if you read encryption schemes. We don’t want others to be able to retrieve the input from the output. We already know that this is possible from other examples of one-way functions (discrete logarithm problem, prime factorization, etc.)

  2. Diffusion (avalanche effect): A change in one bit of the original input should result in at least half the bits of the output. In plain English, a small change in input → huge change in output.

    A comparison using SHA1 hash algorithm:
    • input: “firetrucks are actually water trucks”
      output: “82B874F755C503089E4C430E116D0371209FCE68”

    • input: “Firetrucks are actually water trucks”
      output: “ADD3FAC90A524B938C6266E8FC5B1AFA4CE9CBF1”

    See? Same sentence, same letters… I only capitalized the first letter, and the resulting hash is completely different.

  3. Determinism: Same input → same output. More verbally, if you provide the same input to the algorithm, it will always generate the same output. If you are thinking “how could it be otherwise?”, it could. For example, in each call, the hash function could have been utilizing a random number generator to modify the input. This would result in different outputs for the same input per function call.

  4. Collision resistance: Our sharp-witted readers may have already figured out this, but it’s better to be clear. Remember our input space is infinite (arbitrary length inputs), but output space is limited (fixed-length outputs). This is the crux of the problem. We will not have enough outputs per input. Meaning that, some inputs HAVE TO share the same output. For math people, yes, this is the pigeon-hole principle. Although we cannot escape from this, we want it to be hard to find two different that are mapped to the same output. Why? Because so many protocols and algorithms in cryptology is depending on the security of hash functions. If it was easy to find another input that results in the same output, you could easily fool the people.

  5. Non-predictable: This is somewhat related to one-way property. Basically, without running the algorithm, we shouldn’t be able to predict what the output will be.

But how do hash functions work actually?

You probably don’t know the internals of the hash function. Neither do I :)

Unless you are someone specialized in hash functions, you don’t need to know how hash functions work. I tried once, it was horrible.

Think hash functions as black boxes, you gave them input, they gave you output. And we have defined the properties of these black boxes. Rest is superfluous.