Hash Functions in Python
Python has built-in hash functions. We will use them to learn about how hash functions work.
By default, Python's hash function adds a random number to hash inputs to prevent attackers from taking advantage of known collisions. Hashing the same phrase will not return the same value if you call the program at different times. When calling the python script in the command line, add PYTHONHASHSEED=0 before the command to disable this feature. Example: "PYTHONHASHSEED=0 python3 /users/austinvernon/desktop/scripts/python_hash_function.py"
General Hash Functions
Verify basic features of hash functions.
-
Hashing a given input always results in the same hash value.
-
Hash values are fast to compute.
-
Different inputs have different hash values.
import time
start = time.time()
print(hash("susan"))
print(hash("charlie"))
print(hash("susan"))
end = time.time()
print("elapsed time: ",end-start," seconds")
Output:
">-5124793730868099537"
">-7631605457190692542"
">-5124793730868099537"
">elapsed time: 6.99x10^-5" seconds
The hash values for "susan" are the same, while "charlie" is different. The elapsed time to run all three hashes was a fraction of a second.
Data Retrieval
Python dictionaries use hash tables to store values. We will compare a list lookup (linear search) vs. a dictionary lookup (hash table).
The following script has a short list and dictionary:
import time
dictionary = {"charlie":3, "susan":4, "beth":5 }
list = ["sally", "bob", "joe", "charlie", "susan", "beth"]
#lookups
start = time.time()
print(dictionary["beth"])
end = time.time()
print("Dictionary Lookup Time: ",end-start, " seconds")
start = time.time()
print(list.index("beth"))
end = time.time()
print("List Lookup Time: ",end-start, " seconds")
When I run the script several times, list lookup is faster. The overhead of hashing the input overwhelms the speed advantages of a hash table lookup for short data sets.
Let's try it again with a longer list. I added code to generate the new list (remember to indent the loop body if copy/paste breaks it):
import time
import random
import string
#set variables
dictionary = {}
list = []
i=0
#loop to create random list and dictionary
while i < 10000: #change list size here
letters = string.ascii_lowercase
phrase = ''.join(random.choice(letters) for j in range(10))
list.append(phrase)
dictionary[phrase] = i
i += 1
#append our targets to end of list and dictionary
list.append("beth")
dictionary["beth"] = i
#lookups
start = time.time()
print(dictionary["beth"])
end = time.time()
print("Dictionary Lookup Time: ",end-start, " seconds")
start = time.time()
print(list.index("beth"))
end = time.time()
print("List Lookup Time: ",end-start, " seconds")
I increased the list size by 10x incrementally until I found a crossover at 10,000. As the list size increases, the lookup time for the dictionary remains the same while the list lookup time increases. Put in different x values for "i < x" to change the list size and explore the relationship.
Integrity
The basic idea of checking one file or phrase is inherent to general hash functions, as shown in the "General Hash Functions" section.
Merkle Trees
We will hash a set of data blocks into a Merkle Tree, then change one block to see how it propagates through the tree. For reference, the Merkle Tree structure:
Source: Wikipedia.com
The first tree:
#Pair 1
#data block 1
hash_value_00 = hash('bob')
print("Hash Value 00: ",hash_value_00)
#data block 2
hash_value_01 = hash('alice')
print("Hash Value 01: ",hash_value_01)
#Pair 2
#data block 3
hash_value_10 = hash('sam')
print("Hash Value 10: ",hash_value_10)
#data block 4
hash_value_11 = hash('charlie')
print("Hash Value 11: ",hash_value_11)
#Level2 Hash Values
hash_value_0 = hash(str(hash_value_00) + str(hash_value_01))
print("Hash Value 0: ",hash_value_0)
hash_value_1 = hash(str(hash_value_10) + str(hash_value_11))
print("Hash Value 1: ",hash_value_1)
#root hash/top hash/ Merkle Root
top_hash_value = hash(str(hash_value_1) + str(hash_value_0))
print("Top Hash Value: ",top_hash_value)
Output:
Hash Value 00: -4208795585541223758
Hash Value 01: 8512114763159807471
Hash Value 10: 1019084245346364989
Hash Value 11: -7631605457190692542
Hash Value 0: -4765654072729701387
Hash Value 1: -4212900652853811053
Top Hash Value: -8281923323636112442
Let's use the same code, except change the string in "data block 2" from "alice" to "alicia".
Output:
Hash Value 00: -4208795585541223758
Hash Value 01: -4059417228946341628
Hash Value 10: 1019084245346364989
Hash Value 11: -7631605457190692542
Hash Value 0: 2860753682179309060
Hash Value 1: -4212900652853811053
Top Hash Value: 3683710842423746868
To find the bad data block given the top hash value, you request the next layer of the hash tree. Since only "hash value 0" differs, you can skip checking the "hash value 1" branch of the tree. Next, request the two hash values that formed "hash value 0". You will realize that "hash value 01" is the culprit. The Merkle Tree makes this hunt faster because you can skip parts of the tree. Searching a Merkle Tree is a variation of the binary search topic in computer science. Like using hash tables, the larger the original data set, the more advantage Merkle Trees have.
Crytographic Hash Functions
One of the most commonly used cryptographic hash functions is SHA-256, part of the SHA-2 family of hash functions. Python has a library for using SHA-2 functions, but it is not as user-friendly as calling the simple "hash()" function.
The SHA256 function requires inputs to be encoded since it does not directly hash plaintext. Outputs are converted back into a human-readable format.
import hashlib
input = "sally sells seashells"
# encoding then hashing
result = hashlib.sha256(input.encode())
# To print, we have to convert the sha256 object to hexadecimal
print(result.hexdigest())
Result:
>bc1161cc7e1107e890299ebe4df1495c47d2eef68f46d8d304f68c16b785881b
Notice the hash value is much longer and contains characters other than numbers. The range of possible values helps prevent collisions.
Proof of Work
Let's execute a proof-of-work exercise similar to what Bitcoin uses. Bitcoin uses SHA-256 as its hash function. The idea is to find a hash value with a certain number of leading zeros. The input to the hash function is the hash value of all the previous blocks plus a nonce. A nonce is a random number used only once. Here is the code (remember to indent the loop body if copy/paste breaks it):
import hashlib
transaction = "Alice sends 10 AliceCoins to Bob"
transaction_hash = hashlib.sha256(transaction.encode())
print("transaction hash: ",transaction_hash.hexdigest())
#loop to hash the transaction hash with nonces, using i as nonce
i = 0
while i < 11:
pow_input = transaction_hash.hexdigest() + str(i)
pow_hash = hashlib.sha256(pow_input.encode())
print("PoW Hash ",i, ": ",pow_hash.hexdigest())
i += 1
Result:
transaction hash: 641018871210776bb36695ab958f1b3cb4a6a51666b68c4a8d3cf8b397299e52
PoW Hash 0 : 8370f1ce2d2e164b64d12bfa735c09314a04e8d5a55f1a5228dc8b911da86a84
PoW Hash 1 : dd154ce63969648d33479c3a2ffc00b2c1e43a55e71d250d327abf2afb6a1045
PoW Hash 2 : 9ae7d85bed099e77c55743e598c0cc56a0bf712262a4c57e857cd4b2f08f9ab8
PoW Hash 3 : 494b0bab0cb4da190bc368d6cd6bcf0968ad472128148c93b0366a042615c20b
PoW Hash 4 : 81c297207b278a53afd70515ecf1c5a924a26d5fb8cbbcbd87972c229f17aea3
PoW Hash 5 : 96b246533115ac15684b86e6ee2fe7e8a429a41a127705d00af6439b6ed36ad4
PoW Hash 6 : 1a7201a6e66100db307cc163afb2e0fa77ece8f0dcdf2006936a0b76072c2cd0
PoW Hash 7 : 5dc5b0742de8ef551fd5656804148f40eebe8a52bde9663445ab687e9b211c8f
PoW Hash 8 : 33f294872085ebc862b72875b94cf6a9cd7dad0e84e74794dfed1e9f6f91d4b5
PoW Hash 9 : 7a2ee26384f27546cbb6ddd3a87de4307c74ce357db9b786ef4c46321217776c
PoW Hash 10 : 00d5379946cba7b3a1735947f1a87d8bdf75bf47f85407c4b520bd5f74f0a4a4
We get our first leading zero using the nonce 10 for this particular input hash value. You can change the transaction, nonces, or loop iterations to see different results.
Hash Functions Code Tutorial
2021 June 10 Twitter Substack See all postsHash Functions in Python
Python has built-in hash functions. We will use them to learn about how hash functions work.
By default, Python's hash function adds a random number to hash inputs to prevent attackers from taking advantage of known collisions. Hashing the same phrase will not return the same value if you call the program at different times. When calling the python script in the command line, add PYTHONHASHSEED=0 before the command to disable this feature. Example: "PYTHONHASHSEED=0 python3 /users/austinvernon/desktop/scripts/python_hash_function.py"
General Hash Functions
Verify basic features of hash functions.
Hashing a given input always results in the same hash value.
Hash values are fast to compute.
Different inputs have different hash values.
Output:
The hash values for "susan" are the same, while "charlie" is different. The elapsed time to run all three hashes was a fraction of a second.
Data Retrieval
Python dictionaries use hash tables to store values. We will compare a list lookup (linear search) vs. a dictionary lookup (hash table).
The following script has a short list and dictionary:
When I run the script several times, list lookup is faster. The overhead of hashing the input overwhelms the speed advantages of a hash table lookup for short data sets.
Let's try it again with a longer list. I added code to generate the new list (remember to indent the loop body if copy/paste breaks it):
I increased the list size by 10x incrementally until I found a crossover at 10,000. As the list size increases, the lookup time for the dictionary remains the same while the list lookup time increases. Put in different x values for "i < x" to change the list size and explore the relationship.
Integrity
The basic idea of checking one file or phrase is inherent to general hash functions, as shown in the "General Hash Functions" section.
Merkle Trees
We will hash a set of data blocks into a Merkle Tree, then change one block to see how it propagates through the tree. For reference, the Merkle Tree structure:
Source: Wikipedia.com
The first tree:
Output:
Let's use the same code, except change the string in "data block 2" from "alice" to "alicia".
Output:
To find the bad data block given the top hash value, you request the next layer of the hash tree. Since only "hash value 0" differs, you can skip checking the "hash value 1" branch of the tree. Next, request the two hash values that formed "hash value 0". You will realize that "hash value 01" is the culprit. The Merkle Tree makes this hunt faster because you can skip parts of the tree. Searching a Merkle Tree is a variation of the binary search topic in computer science. Like using hash tables, the larger the original data set, the more advantage Merkle Trees have.
Crytographic Hash Functions
One of the most commonly used cryptographic hash functions is SHA-256, part of the SHA-2 family of hash functions. Python has a library for using SHA-2 functions, but it is not as user-friendly as calling the simple "hash()" function.
The SHA256 function requires inputs to be encoded since it does not directly hash plaintext. Outputs are converted back into a human-readable format.
Result:
Notice the hash value is much longer and contains characters other than numbers. The range of possible values helps prevent collisions.
Proof of Work
Let's execute a proof-of-work exercise similar to what Bitcoin uses. Bitcoin uses SHA-256 as its hash function. The idea is to find a hash value with a certain number of leading zeros. The input to the hash function is the hash value of all the previous blocks plus a nonce. A nonce is a random number used only once. Here is the code (remember to indent the loop body if copy/paste breaks it):
Result:
We get our first leading zero using the nonce 10 for this particular input hash value. You can change the transaction, nonces, or loop iterations to see different results.