0

I'm parsing a large number of delimited text files (over 10k, between 1GB and 20GB each) in Python and extracting lines where a string value from the line (a column value) is in a list of over 7k items. My current code looks something like this:

if columns[idx] in ['1234567','2234567'...(add another 7k items here)]:

The list of possible items is over 69k and I've found no patterns to make it more efficient to test whether an item should be included. For example, it would more efficient if there were patterns such as 'if the first 2 characters are '12' then include the item', but I don't see any patterns like that. Does anyone know of a more efficient way of doing this? You may wonder why I don't import the data into a database. There are 2 reasons. One, I don't have a database with anywhere near enough space to hold the data and, two, the file uses non-standard column delimiters - '\x01'.

Not surprisingly, the PyCharm performance profiler shows this line as a major bottleneck in my code. I've tested removing this line and writing all the data out to a text file which I could then load piecemeal into a database (where I could then filter it), but the 'in' filters 90% of the data and writing out all of the data is slower than filtering with 'in'.

I'm using Python 2.7.12 on Windows 10. I also have an Ubuntu 12.04.5 machine available to use (with Python 2.7.12).

Dwayne Driskill
  • 1,470
  • 1
  • 13
  • 19
  • 7
    Can you not just convert `['1234567','2234567'...(add another 7k items here)]` to a `set` before checking for membership? You needn't lose the initial list if you need duplicates, just bind the `set` to another name. [Set lookup is O(1)](https://stackoverflow.com/questions/3949310/how-is-set-implemented). – roganjosh Jan 18 '18 at 22:13
  • 1
    Regarding speed: What do the text files look like? If you can process it line wise you can of course use multiprocessing to process chunks in parallel. Of course, you could also slice the list/array and dispatch it to a parallel process. – Stefan Falk Jan 18 '18 at 22:16
  • 2
    As @roganjosh suggests, the simplest thing is to convert from a list to a faster-lookup data structure such as a set or dict. If this weren't Python, or even when it is if the list/set/dict-key items are constant or very large, you might want to build a specialized lookup (e.g., Bloom filter for O(1) probablistic membership). – torek Jan 18 '18 at 22:17
  • I'll try converting the list to a set. I should have mentioned I'm relatively new to Python so I may be missing obvious solutions. – Dwayne Driskill Jan 18 '18 at 22:39
  • 1
    Say you initially had `my_list = ['1234567','2234567'...(add another 7k items here)]`. Add another line of code under that: `my_set = set(my_list)`. Then change your lookup to `if columns[idx] in my_set:`. The difference in speed will be _huge_. – roganjosh Jan 18 '18 at 22:41
  • Possible duplicate of [Tuple or list when using 'in' in an 'if' clause?](https://stackoverflow.com/questions/25368337/tuple-or-list-when-using-in-in-an-if-clause) – roganjosh Jan 18 '18 at 22:51

0 Answers0