I'm parsing a large number of delimited text files (over 10k, between 1GB and 20GB each) in Python and extracting lines where a string value from the line (a column value) is in a list of over 7k items. My current code looks something like this:
if columns[idx] in ['1234567','2234567'...(add another 7k items here)]:
The list of possible items is over 69k and I've found no patterns to make it more efficient to test whether an item should be included. For example, it would more efficient if there were patterns such as 'if the first 2 characters are '12' then include the item', but I don't see any patterns like that. Does anyone know of a more efficient way of doing this? You may wonder why I don't import the data into a database. There are 2 reasons. One, I don't have a database with anywhere near enough space to hold the data and, two, the file uses non-standard column delimiters - '\x01'.
Not surprisingly, the PyCharm performance profiler shows this line as a major bottleneck in my code. I've tested removing this line and writing all the data out to a text file which I could then load piecemeal into a database (where I could then filter it), but the 'in' filters 90% of the data and writing out all of the data is slower than filtering with 'in'.
I'm using Python 2.7.12 on Windows 10. I also have an Ubuntu 12.04.5 machine available to use (with Python 2.7.12).