6

I have a file with over 4 million lines that I want to read in, line by line, and perform actions on the first "column" of data. There may be duplicated data, and I want to make sure that I am only performing these actions once, and be able to restart the program and start where it left off.

My solution was to parse the line, store the value into a list, and then "If x not in list" perform actions, and then store that list as a pickle if/when the program dies/stops/is killed.

However the script segfaults after running for several hours, and I assume it is because I have burned through all my ram due to storing a lot of MD5 sums in the checkedFile list.

Because the file is so large (4million+ entries) I need something that is fast for lookups, hoping there is a simple solution. Thanks!

Data looks like this: 6e5f6c90e6bf31c31283afb8dca93da8|mouse.gif|10102017

My current code looks like this:

checkedFile = []

def readline(inFile):
print("Opening file: %s\n" % inFile)
try:
    with open(inFile, "r") as inputFile:
        for line in inputFile: # read in file line by line
            fileHash, garbage = line.split("|",1) 
            if fileHash not in checkedFile:
                checkedFile.append(fileHash)
                checkit(fileHash) # 
Bill Swearingen
  • 634
  • 2
  • 8
  • 17
  • what does checkit do? if your file is around 4MB then it should not be any problem... => lists and strings in python have only 1 limitation which is RAM size as far as i recall. I'm expecting your ram is more like in the GB range :-) – studioj Jan 20 '18 at 13:33
  • The file reading is done as an iteration, so that's unlikely to cause the memory issue. So it's either that `checkedFile` gets too big, or that `checkit` is causing the issue (whatever that does?) – match Jan 20 '18 at 13:34
  • it takes fileHash and does a REST lookup to a local running webservice that returns json. My assumption is that "checkedFile" is getting to big. – Bill Swearingen Jan 20 '18 at 13:36
  • 1
    well a good practice of "single responsibility" is first load the full inFile into checkedFile. After that iterate over checkedFile and do a checkit on each elment. in this case you know which one of the 2 is failing. Could you please also get the full error in here? – studioj Jan 20 '18 at 13:40
  • unfortunately I don't have the full error in front of me, I restarted it, but it will take a couple hours before it dies. – Bill Swearingen Jan 20 '18 at 13:43
  • 2
    There are at least two problems: the segmentation fault and the inefficient treatment of your `checkedFile`. Membership testing in a list is very inefficient. Replace it with `checkedFile = set()` (to make a set) and replace `checkedFile.append(fileHash)` with `checkedFile.add(fileHash)`. That way at least you won't be burning a lot of time just on the bookkeeping. – DSM Jan 20 '18 at 13:46
  • Storing the md5sum as a binary digest instead of hexdigest could potentially halve the total size of `checkedFile` - something like `binascii.unhexlify()` might help here. – match Jan 20 '18 at 13:54
  • use divide and conquer; part the contents into chunks and process different chunks in thread in such a manner that will store data in other database; this will require more computing power and space. – Gahan Jan 20 '18 at 13:55
  • @match Or as an `int`. That's even a bit smaller for me (30 bytes for `int` vs 33 bytes for `bytes`). – Stefan Pochmann Jan 20 '18 at 22:10

1 Answers1

2

One of the issues I would imagine you are having is the data structure that you have chosen. Since checkedFile is a list, checking if a hash is in the list have O(N) time complexity. The first suggestion I would have is use a python set. This uses a hash map which has a performance of O(1).

checkedFile = set()

def readline(inFile):
    print("Opening file: %s\n" % inFile)
    with open(inFile, "r") as inputFile:
        for line in inputFile: # read in file line by line
            fileHash, garbage = line.split("|",1) 
            if fileHash not in checkedFile:
                checkedFile.add(fileHash)
                checkit(fileHash)

Python only really segfaults for 2 reasons so you definitely ran out of memory. I find it weird that you segfaulted with list storage because each hash is 16 bytes * 4 million = 64Mb which I would image is no where near the amount of ram in your typical computer.

Going the persistent storage route I would suggest sqlite. This approach cannot run out of memory since it will be stored on your hard drive.

 import sqlite3
 connection = sqlite3.connect('cache.db')
 CREATE_CACHE_TABLE = '''CREATE TABLE IF NOT EXISTS cache ( 
                         hash TEXT PRIMARY KEY
               )'''
 SELECT_STATEMENT = 'SELECT hash FROM cache WHERE hash = ?'
 INSERT_STATEMENT = 'INSERT INTO cache VALUES ?'

 with connection:
     connection.execute(CREATE_CACHE_TABLE)
 with open(inFile, "r") as inputFile:
      for line in inputFile:
          fileHash, garbage = line.split("|", 1)
          with connection:
               if not connection.execute(SELECT_STATMENT, (fileHash,)).fetchone():
                    connection.execute(INSERT_STATEMENT, (fileHash,))
                    checkit(fileHash)

As an added bonus you can also cache the results from checkit so that you can resume the calculation if it stops for some reason. This will also benifit you becuase a web request will take a minimum response time of 10ms. You mentioned that the web request returns a json response. I have written about this before sqlite with python how to store json in sqlite. You will need to modify the function checkit to return the json data.

import sqlite3
import json

def adapt_json(data):
    return (json.dumps(data, sort_keys=True)).encode()

def convert_json(blob):
    return json.loads(blob.decode())

sqlite3.register_adapter(dict, adapt_json)
sqlite3.register_converter('JSON', convert_json)

connection = sqlite3.connect('cache.db', detect_types=sqlite3.PARSE_DECLTYPES)
CREATE_CACHE_TABLE = '''CREATE TABLE IF NOT EXISTS cache (
                         hash TEXT PRIMARY KEY,
                         check_results JSON
               )'''
SELECT_STATEMENT = 'SELECT hash FROM cache WHERE hash = ?'
INSERT_STATEMENT = 'INSERT INTO cache VALUES ?, ?'

with connection:
    connection.execute(CREATE_CACHE_TABLE)
with open(inFile, "r") as inputFile:
    for line in inputFile:
        fileHash, garbage = line.split("|", 1)
        with connection:
            if not connection.execute(SELECT_STATMENT, (fileHash,)).fetchone():
                json_data = checkit(fileHash)
                connection.execute(INSERT_STATEMENT, (fileHash, json_data))

If you use this final approach. You can run this program in parallel with different chunks of the file and it will still work. Also this approach should scale very well if you end up with a larger file.

costrouc
  • 3,045
  • 2
  • 19
  • 23
  • hey, thanks for taking so much time to explain this. I really appreciate your response, it was very helpful and will implement it in this way. I always learn so much on this site! – Bill Swearingen Jan 20 '18 at 14:22
  • 1
    No problem! I am sorry that the persistent storage route approach is more complicated. There may be a simpler solution but this is the simplest and most performant option that I know of. You do get the benefit that you can run this program multiple times with different parts of the file in parallel. Your bottleneck I would imagine is the web requests. 10ms * 4 million = about 11 hours. – costrouc Jan 20 '18 at 14:31
  • *"a web request will take a minimum response time of 10ms"*? Where did you get that idea? – Stefan Pochmann Jan 20 '18 at 21:43
  • Here is a good reference https://www.techempower.com/benchmarks/ . For python this holds true but 1-2 ms is an absolute minimum for all languages. – costrouc Jan 21 '18 at 03:41