0

I wrote a method to generate hashes and return them in a list of dictionaries. It works well with small amount of records, for example, 100. But it requires around 17 minutes to generate hashes for 10000 records.

How can the following code be improved to process 10000 records faster (couple of minutes)? Maybe multithreading will help me?

def generate_hashes(self, records):

       def get_year(date):
           return str(date.year)

       def create_hash(string):
           md5 = hashlib.md5()
           md5.update(string)
           return md5.hexdigest()

       result = []
       for rec in records:
           rec_dict = {}
           if rec.dob != None and rec.priv_number != None:
               org_hash = "{0}_{1}".format(create_hash(rec.priv_number), get_year(rec.dob))
               group_hash = create_hash("{0}_{1}".format(create_hash(org_hash), '144C5A0013EDE1B0ACF585'))
               rec_hash = group_hash
               print("Generate hash for %s rec." % rec.pub_number.pub_number)

           else:
               rec_hash = '0a'*16
               print("There are not enough data to create hash for rec %s." % rec.pub_number.pub_number)

           rec_dict.update({'hash': rec_hash, 'pub_number': rec.pub_number.pub_number})
           result.append(rec_dict)

        return result
srgbnd
  • 5,404
  • 9
  • 44
  • 80
  • You may consider looking at this post - http://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed – Wasi Ahmad Nov 08 '16 at 21:57
  • 1
    `multithreading` won't help due to the GIL. I'm confused about `create_hash("{0}_{1}".format(create_hash(org_hash)` though. Is it right to hash `org_hash`, then to hash again? Especially since `org_hash` is the result of `create_hash(rec.priv_number)` – roganjosh Nov 08 '16 at 22:02
  • small improvements not related to hash computation: create hash directly: `md5 = hashlib.md5(string)`. And don't create `rec_dict` empty to update it afterwards. Just do `rec_dict = {'hash': rec_hash, 'pub_number': rec.pub_number.pub_number}` – Jean-François Fabre Nov 08 '16 at 22:02
  • @roganjosh I have a requirement to hash twice. An API I use requires it. – srgbnd Nov 08 '16 at 22:09
  • But you're hashing 3 times? `create_hash("{0}_{1}".format(create_hash(org_hash)`. `org_has` is already hashed, then you hash it again when you are using `format`, then hash the whole string again. – roganjosh Nov 08 '16 at 22:12
  • @roganjosh sorry, three times. It is the requirement. – srgbnd Nov 08 '16 at 22:33
  • 1
    depending on whether you need a list or just an iterable you can use `yield rec_dict` instead of the `result.append(rec_dict)` so you have no need for the list and return statement. Generators tend to be more efficient than appending to a list [yield explained](http://stackoverflow.com/a/231855/1562285) [yield explained](http://stackoverflow.com/q/1756096/1562285) – Maarten Fabré Nov 08 '16 at 22:47

0 Answers0