How to extract specific lines from a huge text file (>16GB) where each line starts with a string specified in another input file?

Question

I have a huge text file (>16 GB size) where each line is of the form

22_0F3, 33_0F4, 0.87
28_0F3, 37_0F4, 0.79
.................... . . .
21_0F2, 32_2F1, 0.86

I have to extract all lines from this huge text file that start with the strings specified in another file as

22_0F3, 33_0F4
32_0F1, 21_2F2
.............. . .

The code below does this job but the problem is it takes much time to finish.

huge = open('huge.txt')
lines= open('lines.txt')
output = open('output','w')


X=[]
l=[]

for line1 in lines:
    x1 = line1.split(',')[0].strip()
    x2 = line1.split(',')[1].strip()
    XX = [x1, x2]
    X.append(XX)

for line3 in huge:
    z1 = line3.split(',')[0].strip()
    z2 = line3.split(',')[1].strip()
    z3 = line3.split(',')[2].strip()
    ZX = [z1, z2]
    ZY = [z2, z1]
    if ZX in X or ZY in X:
        ZX.append(z3)
        l.append(ZX)
        print(ZX)

for i in l:
    output.write(str(i)[1:-1]+'\n')
output.close()


Expected output:
1. 22_0F3, 33_0F4, 0.87
2. 32_2F1, 21_0F2, 0.86

I'm a beginner in python programming, can anybody help me with optimizing this code to get the result fast?

Is there any faster method to get the same output?

Simple option to reduce the amount of loops would be to loop once through your file that holds the things you are looking for and build up a dictionary of keys. Then in your loop through the file you extract from it's a single loop and lookup in the dictionary for each, which is extremely fast compared to what you are doing now. — MyNameIsCaleb, May 17 '19 at 18:39
[This answer](https://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table) talks through the differences in list vs dict lookups. — MyNameIsCaleb, May 17 '19 at 18:39
@MyNameIsCaleb Could you please show me how I can rewrite this with dict lookups?. I have no real experience with python programming. — Sara S, May 17 '19 at 18:44
I added an answer to show the big changes. That should increase the speed by a lot. — MyNameIsCaleb, May 17 '19 at 18:49

MyNameIsCaleb · Accepted Answer · 2019-05-17T19:11:33.933

Change it to a dictionary lookup, similar to below. You may need to mess with the output a little because I don't have the full ability to test how it will look but it should replicate the function fairly well.

huge = open('huge.txt')
lines= open('lines.txt')
output = open('output','w')


lookup_from = {}
l=[]

for line1 in lines:   # if this is what you are referencing your lookups from
    x1 = line1.split(',')[0].strip()
    x2 = line1.split(',')[1].strip()
    XX = (x1, x2)   # must be a tuple to be a dictionary key instead of a list
    lookup_from[XX] = 0   # assign the key to the dictionary with an arbitrary 0 value

for line3 in huge:
    z1 = line3.split(',')[0].strip()
    z2 = line3.split(',')[1].strip()
    z3 = line3.split(',')[2].strip()
    ZX = (z1, z2)   # tuple again for dict
    ZY = (z2, z1)   # tuple
    if ZX in lookup_from or ZY in lookup_from:
        ZX = ZX + (z3,)
        l.append(ZX)
        print(ZX)

for i in l:
    output.write(str(i)[1:-1]+'\n')
output.close()

Expected output:

1. 22_0F3, 33_0F4, 0.87
2. 32_2F1, 21_0F2, 0.86

Additionally to improve speed, you could reduce from two lookups to one. Right now you are checking (X, Y) and (Y, X), but instead you could always put in your lookups in a specific order (alphabetically perhaps), and then always lookup using that order as well.

Thanks. But this shows the following error " ZX = ZX + z3 TypeError: can only concatenate tuple (not "str") to tuple " — Sara S, May 17 '19 at 18:55
Can you do a `print(type(ZX))` and a `print(type(z3))` at the line before you get the error and tell me what it says — MyNameIsCaleb, May 17 '19 at 19:01
Ah sorry, to make sure it's a tuple it needs a comma, try: `ZX = ZX + (z3,)` — MyNameIsCaleb, May 17 '19 at 19:11

How to extract specific lines from a huge text file (>16GB) where each line starts with a string specified in another input file?

1 Answers1