Hash Functions Code Tutorial - Austin Vernon's Blog

Hash Functions in Python

Python has built-in hash functions. We will use them to learn about how hash functions work.

By default, Python's hash function adds a random number to hash inputs to prevent attackers from taking advantage of known collisions. Hashing the same phrase will not return the same value if you call the program at different times. When calling the python script in the command line, add PYTHONHASHSEED=0 before the command to disable this feature. Example: "PYTHONHASHSEED=0 python3 /users/austinvernon/desktop/scripts/python_hash_function.py"

General Hash Functions

Verify basic features of hash functions.

Hashing a given input always results in the same hash value.
Hash values are fast to compute.

Different inputs have different hash values.

import time

start = time.time()
print(hash("susan"))
print(hash("charlie"))
print(hash("susan"))
end = time.time()
print("elapsed time: ",end-start," seconds")

Output:

    ">-5124793730868099537"
    ">-7631605457190692542"
    ">-5124793730868099537"
    ">elapsed time: 6.99x10^-5" seconds

The hash values for "susan" are the same, while "charlie" is different. The elapsed time to run all three hashes was a fraction of a second.

Data Retrieval

Python dictionaries use hash tables to store values. We will compare a list lookup (linear search) vs. a dictionary lookup (hash table).

The following script has a short list and dictionary:

    import time

    dictionary = {"charlie":3, "susan":4, "beth":5 }
    list = ["sally", "bob", "joe", "charlie", "susan", "beth"]

    #lookups
    start = time.time()
    print(dictionary["beth"])
    end = time.time()
    print("Dictionary Lookup Time: ",end-start, " seconds")

    start = time.time()
    print(list.index("beth"))
    end = time.time()
    print("List Lookup Time: ",end-start, " seconds")

When I run the script several times, list lookup is faster. The overhead of hashing the input overwhelms the speed advantages of a hash table lookup for short data sets.

Let's try it again with a longer list. I added code to generate the new list (remember to indent the loop body if copy/paste breaks it):

    import time
    import random
    import string

    #set variables
    dictionary = {}
    list = []
    i=0

    #loop to create random list and dictionary
    while i < 10000:  #change list size here
        letters = string.ascii_lowercase
        phrase = ''.join(random.choice(letters) for j in range(10))
        list.append(phrase)
        dictionary[phrase] = i
        i += 1

    #append our targets to end of list and dictionary
    list.append("beth")
    dictionary["beth"] = i

    #lookups
    start = time.time()
    print(dictionary["beth"])
    end = time.time()
    print("Dictionary Lookup Time: ",end-start, " seconds")

    start = time.time()
    print(list.index("beth"))
    end = time.time()
    print("List Lookup Time: ",end-start, " seconds")

I increased the list size by 10x incrementally until I found a crossover at 10,000. As the list size increases, the lookup time for the dictionary remains the same while the list lookup time increases. Put in different x values for "i < x" to change the list size and explore the relationship.

Integrity

The basic idea of checking one file or phrase is inherent to general hash functions, as shown in the "General Hash Functions" section.

Merkle Trees

We will hash a set of data blocks into a Merkle Tree, then change one block to see how it propagates through the tree. For reference, the Merkle Tree structure:

Source: Wikipedia.com

The first tree:

    #Pair 1
    #data block 1
    hash_value_00 = hash('bob')
    print("Hash Value 00: ",hash_value_00)
    #data block 2
    hash_value_01 = hash('alice')
    print("Hash Value 01: ",hash_value_01)

    #Pair 2
    #data block 3
    hash_value_10 = hash('sam')
    print("Hash Value 10: ",hash_value_10)
    #data block 4
    hash_value_11 = hash('charlie')
    print("Hash Value 11: ",hash_value_11)

    #Level2 Hash Values
    hash_value_0 = hash(str(hash_value_00) + str(hash_value_01))
    print("Hash Value 0: ",hash_value_0)
    hash_value_1 = hash(str(hash_value_10) + str(hash_value_11))
    print("Hash Value 1: ",hash_value_1)

    #root hash/top hash/ Merkle Root
    top_hash_value = hash(str(hash_value_1) + str(hash_value_0))

    print("Top Hash Value: ",top_hash_value)

Output:

        Hash Value 00:  -4208795585541223758
        Hash Value 01:  8512114763159807471
        Hash Value 10:  1019084245346364989
        Hash Value 11:  -7631605457190692542
        Hash Value 0:  -4765654072729701387
        Hash Value 1:  -4212900652853811053
        Top Hash Value:  -8281923323636112442

Let's use the same code, except change the string in "data block 2" from "alice" to "alicia".

Output:

        Hash Value 00:  -4208795585541223758
        Hash Value 01:  -4059417228946341628
        Hash Value 10:  1019084245346364989
        Hash Value 11:  -7631605457190692542
        Hash Value 0:  2860753682179309060
        Hash Value 1:  -4212900652853811053
        Top Hash Value:  3683710842423746868

To find the bad data block given the top hash value, you request the next layer of the hash tree. Since only "hash value 0" differs, you can skip checking the "hash value 1" branch of the tree. Next, request the two hash values that formed "hash value 0". You will realize that "hash value 01" is the culprit. The Merkle Tree makes this hunt faster because you can skip parts of the tree. Searching a Merkle Tree is a variation of the binary search topic in computer science. Like using hash tables, the larger the original data set, the more advantage Merkle Trees have.

Crytographic Hash Functions

One of the most commonly used cryptographic hash functions is SHA-256, part of the SHA-2 family of hash functions. Python has a library for using SHA-2 functions, but it is not as user-friendly as calling the simple "hash()" function.

The SHA256 function requires inputs to be encoded since it does not directly hash plaintext. Outputs are converted back into a human-readable format.

    import hashlib

    input = "sally sells seashells"

    # encoding then hashing
    result = hashlib.sha256(input.encode())

    # To print, we have to convert the sha256 object to hexadecimal
    print(result.hexdigest())

Result:

      >bc1161cc7e1107e890299ebe4df1495c47d2eef68f46d8d304f68c16b785881b

Notice the hash value is much longer and contains characters other than numbers. The range of possible values helps prevent collisions.

Proof of Work

Let's execute a proof-of-work exercise similar to what Bitcoin uses. Bitcoin uses SHA-256 as its hash function. The idea is to find a hash value with a certain number of leading zeros. The input to the hash function is the hash value of all the previous blocks plus a nonce. A nonce is a random number used only once. Here is the code (remember to indent the loop body if copy/paste breaks it):

    import hashlib

    transaction = "Alice sends 10 AliceCoins to Bob"

    transaction_hash = hashlib.sha256(transaction.encode())

    print("transaction hash: ",transaction_hash.hexdigest())

    #loop to hash the transaction hash with nonces, using i as nonce
    i = 0
    while i < 11:
        pow_input = transaction_hash.hexdigest() + str(i)
        pow_hash = hashlib.sha256(pow_input.encode())
        print("PoW Hash ",i, ": ",pow_hash.hexdigest())
        i += 1

Result:

    transaction hash:  641018871210776bb36695ab958f1b3cb4a6a51666b68c4a8d3cf8b397299e52
    PoW Hash  0 :  8370f1ce2d2e164b64d12bfa735c09314a04e8d5a55f1a5228dc8b911da86a84
    PoW Hash  1 :  dd154ce63969648d33479c3a2ffc00b2c1e43a55e71d250d327abf2afb6a1045
    PoW Hash  2 :  9ae7d85bed099e77c55743e598c0cc56a0bf712262a4c57e857cd4b2f08f9ab8
    PoW Hash  3 :  494b0bab0cb4da190bc368d6cd6bcf0968ad472128148c93b0366a042615c20b
    PoW Hash  4 :  81c297207b278a53afd70515ecf1c5a924a26d5fb8cbbcbd87972c229f17aea3
    PoW Hash  5 :  96b246533115ac15684b86e6ee2fe7e8a429a41a127705d00af6439b6ed36ad4
    PoW Hash  6 :  1a7201a6e66100db307cc163afb2e0fa77ece8f0dcdf2006936a0b76072c2cd0
    PoW Hash  7 :  5dc5b0742de8ef551fd5656804148f40eebe8a52bde9663445ab687e9b211c8f
    PoW Hash  8 :  33f294872085ebc862b72875b94cf6a9cd7dad0e84e74794dfed1e9f6f91d4b5
    PoW Hash  9 :  7a2ee26384f27546cbb6ddd3a87de4307c74ce357db9b786ef4c46321217776c
    PoW Hash  10 :  00d5379946cba7b3a1735947f1a87d8bdf75bf47f85407c4b520bd5f74f0a4a4

We get our first leading zero using the nonce 10 for this particular input hash value. You can change the transaction, nonces, or loop iterations to see different results.