2

All, I am rather new and am looking for assistance. I need to perform a string search on a data set that compressed is about 20 GB of data. I have an eight core ubuntu box with 32 GB of RAM that I can use to crunch through this but am not able to implement nor determine the best possible code for such a task. Would Threading or multiprocessing be best for such a task? Please provide code samples. Thank you. Please see my current code;

#!/usr/bin/python
import sys
logs = []
iplist = []

logs = open(sys.argv[1], 'r').readlines()
iplist = open(sys.argv[2], 'r').readlines()
print "+Loaded {0} entries for {1}".format(len(logs), sys.argv[1])
print "+Loaded {0} entries for {1}".format(len(iplist), sys.argv[2])

for a in logs:
    for b in iplist:
        if a.lower().strip() in b.lower().strip()
            print "Match! --> {0}".format(a.lower().strip())
noobie
  • 35
  • 2
  • 2
    I'm not sure Python is the tool for your task. Why don't you just load those in sqlite or something? – Thomas Orozco Apr 20 '13 at 03:04
  • I am sorry but can you please explain your reasoning? I really thought this would be a good use case for Python. – noobie Apr 20 '13 at 03:49
  • Well, this would be easy to do in SQL, but certainly possible in Python. Are there only two giant files, or lots of smaller files? – reptilicus Apr 20 '13 at 04:03
  • Thanks! The huge file would be the "logs" (currently gzip) file. Decompressed, I am thinking it would be over 200 GB. The second file "iplist" is a small txt file with about 200 IP addresses. I did not enter it in the code to keep it cleaner for this post but currently the code is as follows to open a handle for the gziped file; import sys, gzip logs = gzip.open(sys.argv[1]. 'rb').readlines() – noobie Apr 20 '13 at 04:13
  • One single giant file, 20 GB? Readlines will read the entire file, so you either want to use [size hint](http://docs.python.org/2/library/stdtypes.html?highlight=readlines#file.readlines) , or `readline(size)` or `read(size)`. -- [For multiprocessing vs threading](http://stackoverflow.com/questions/3044580/multiprocessing-vs-threading-python/3046201#3046201) – ninMonkey Apr 20 '13 at 05:52
  • 1
    Do not use `readlines()`, simply iterate over the files: `logs = open(...); for a in logs:`. Also you could call `lower().strip()` on all `b`s out of the loop (`iplist = open(...); iplist = [b.lower().strip() for b in iplist]`) this will halve the amount of work done in the inner loop. Regarding multithreading/multiprocessing take a look at the [`Queue`](http://docs.python.org/2/library/queue.html) module. – Bakuriu Apr 20 '13 at 07:20

1 Answers1

1

I'm not sure if multithreading can help you, but your code has a problem that is bad for performance: Reading the logs in one go consumes incredible amounts of RAM and thrashes your cache. Instead, open it and read it sequentially, after all you are making a sequential scan, don't you? Then, don't repeat any operations on the same data. In particular, the iplist doesn't change, but for every log entry, you are repeatedly calling b.lower().strip(). Do that once, after reading the file with the IP addresses.

In short, this looks like this:

with open(..) as f:
    iplist = [l.lower().strip() for l in f]

with open(..) as f:
    for l in f:
        l = l.lower().strip()
        if l in iplist:
            print('match!')

You can improve performance even more by using a set for iplist, because looking things up there will be faster when there are many elements. That said, I'm assuming that the second file is huge, while iplist will remain relatively small.

BTW: You could improve performance with multiple CPUs by using one to read the file and the other to scan for matches, but I guess the above will already give you a sufficient performance boost.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55