comparison Algorithm for two hashes

Question

As an initial situation, I have a sha1 hash value. I want to compare this with a file full of hash values to see if the sha1 hash value is contained in the file with the hash values.

So more exactly:

f1=sha1 #value read in
fobj = open("Hashvalues.txt", "r") #open file with hash values
for f1 in fobj:
  print ("hash value found")
else:
  print("HashValue not found")
fobj.close()

The file is very large (11.1GB)

Is there a useful algorithm to perform the search as fast as possible? The hash values in the hash file are ordered by hashes. I think comparing this line by line won't be the fastest way, will it?

EDIT: I changed my Code as follows:

f1="9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in
with open("pwned-passwords-sha1-ordered-by-hash-v5.txt") as f:
lineList = [line.rstrip('\n\r') for line in open("pwned-passwords-sha1- 
ordered-by-hash-v5.txt")]

def binarySearch(arr, l, r, x):

while l <= r:

    mid = l + (r - l)/2;

    # Check if x is present at mid
    if arr[mid] == x:
        return mid

    # If x is greater, ignore left half
    elif arr[mid] < x:
        l = mid + 1

    # If x is smaller, ignore right half
    else:
        r = mid - 1

# If we reach here, then the element
# was not present
return -1


# Test array
arr = lineList
x = "9bc34549d565d9505b287de0cd20ac77be1d3f2c" #value read in

# Function call
result = binarySearch(arr, 0, len(arr)-1, x)

if result != -1:
   print "Element is present at index % d" % result
else:
   print "Element is not present in array"

But it doesn't work as fast as i thought. Is my Implementation correct?

EDIT2:

def binarySearch (l, r, x):

# Check base case
if r >= l:

    mid = l + (r - l)/2

    # If element is present at the middle itself
    if getLineFromFile(mid) == x:
        return mid

    # If element is smaller than mid, then it
    # can only be present in left subarray
    elif getLineFromFile(mid) > x:
        return binarySearch(l, mid-1, x)

    # Else the element can only be present
    # in right subarray
    else:
        return binarySearch(mid + 1, r, x)

else:
    # Element is not present in the array
    return -1

x = '0000000A0E3B9F25FF41DE4B5AC238C2D545C7A8:15'

def getLineFromFile(lineNumber):
 with open('testfile.txt') as f:
  for i, line in enumerate(f):
    if i == lineNumber:
     return line
else:
 print('Not 7 lines in file')
 line = None

# get last Element of List
 def tail():
  for line in open('pwned.txt', 'r'):
    pass
  else:
    print line

 ausgabetail = tail()
 #print ausgabetail
 result = binarySearch( 0, ausgabetail, x)
 if result != -1:
    print "Element is present at index % d" % result
 else:
    print "Element is not present in array"

My problem now is to get the correct index for the right side for the binary search. I pass the function (l, r, x). The left side starts at the beginning with 0. The right side should be the end of the file so the last line. I try to get that but it doesn't work. I tried to get this with the Funktion tail(). But if I print r on testing, I get the value "None". Do you have another idea here?

A reasonable approach is a binary search that is explained in this post: https://stackoverflow.com/a/5219275/3768871 — OmG, Sep 17 '19 at 08:38
You can use `file.seek` to go to a certain position in the file in O(1), backtrack to the last line break before that, read the line, and in that way do binary search on the file. See e.g. [here](https://stackoverflow.com/q/54856167/1639625). — tobias_k, Sep 17 '19 at 08:41
@OmG that is only true if the hashes in the file are alphabetically sorted. — nikoksr, Sep 17 '19 at 13:02
My Problem is also, the file is too larger to load in the memory. Till now my Process got killed before i got an answer — Felix Grötz, Sep 17 '19 at 13:17

Ravi Chandak · Answer 1 · 2019-09-18T02:41:42.743

0

Looking at the code I see that you are still reading all the lines from the file, this indeed is the bottleneck. It's not the binary search. Assuming that the hashes are sorted You can just read the number of lines in the file. Then just perform the binary search. Instead of reading the entire file u can use Seek to reach a particular line in the file that way you will only be reading log(n) number of lines. That should increase the speed.

Example

def binarySearch(l, r, x): 
....
#change arr[mid] with getLineFromFile(mid)


....

def getLineFromFile(lineNumber):
with open('xxx.txt') as f:
for i, line in enumerate(f):
    if i == lineNumber:
        return line
else:
    print('Not 7 lines in file')
    line = None

edited Sep 18 '19 at 02:41

answered Sep 17 '19 at 14:48

Ravi Chandak

113
1
9

could you please make a little example? – Felix Grötz Sep 17 '19 at 19:15
Thanks that helps me a lot! :). But i got one other Problem: how do i set the right index? Before that I took the array length -1. How can I do that now? – Felix Grötz Sep 18 '19 at 05:04
That should not be a concern there are ways to count number of lines in a file in efficient manner https://stackoverflow.com/questions/9629179/python-counting-lines-in-a-huge-10gb-file-as-fast-as-possible – Ravi Chandak Sep 19 '19 at 03:01
thanks again. I've tried it a lot now, but it's not working. Did you give me an example code how it could work? Even with your link I can't get it right. It's frustrating for me, sorry :( – Felix Grötz Sep 24 '19 at 15:04

comparison Algorithm for two hashes

1 Answers1