0

I have a huge text file (>16 GB size) where each line is of the form

  1. 22_0F3, 33_0F4, 0.87
  2. 28_0F3, 37_0F4, 0.79
  3. .................... . . .
  4. 21_0F2, 32_2F1, 0.86

I have to extract all lines from this huge text file that start with the strings specified in another file as

  1. 22_0F3, 33_0F4
  2. 32_0F1, 21_2F2
  3. .............. . .

The code below does this job but the problem is it takes much time to finish.

huge = open('huge.txt')
lines= open('lines.txt')
output = open('output','w')


X=[]
l=[]

for line1 in lines:
    x1 = line1.split(',')[0].strip()
    x2 = line1.split(',')[1].strip()
    XX = [x1, x2]
    X.append(XX)

for line3 in huge:
    z1 = line3.split(',')[0].strip()
    z2 = line3.split(',')[1].strip()
    z3 = line3.split(',')[2].strip()
    ZX = [z1, z2]
    ZY = [z2, z1]
    if ZX in X or ZY in X:
        ZX.append(z3)
        l.append(ZX)
        print(ZX)

for i in l:
    output.write(str(i)[1:-1]+'\n')
output.close()


Expected output:
1. 22_0F3, 33_0F4, 0.87
2. 32_2F1, 21_0F2, 0.86


I'm a beginner in python programming, can anybody help me with optimizing this code to get the result fast?

Is there any faster method to get the same output?

Sara S
  • 153
  • 5
  • 2
    Simple option to reduce the amount of loops would be to loop once through your file that holds the things you are looking for and build up a dictionary of keys. Then in your loop through the file you extract from it's a single loop and lookup in the dictionary for each, which is extremely fast compared to what you are doing now. – MyNameIsCaleb May 17 '19 at 18:39
  • [This answer](https://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table) talks through the differences in list vs dict lookups. – MyNameIsCaleb May 17 '19 at 18:39
  • @MyNameIsCaleb Could you please show me how I can rewrite this with dict lookups?. I have no real experience with python programming. – Sara S May 17 '19 at 18:44
  • I added an answer to show the big changes. That should increase the speed by a lot. – MyNameIsCaleb May 17 '19 at 18:49

1 Answers1

1

Change it to a dictionary lookup, similar to below. You may need to mess with the output a little because I don't have the full ability to test how it will look but it should replicate the function fairly well.

huge = open('huge.txt')
lines= open('lines.txt')
output = open('output','w')


lookup_from = {}
l=[]

for line1 in lines:   # if this is what you are referencing your lookups from
    x1 = line1.split(',')[0].strip()
    x2 = line1.split(',')[1].strip()
    XX = (x1, x2)   # must be a tuple to be a dictionary key instead of a list
    lookup_from[XX] = 0   # assign the key to the dictionary with an arbitrary 0 value

for line3 in huge:
    z1 = line3.split(',')[0].strip()
    z2 = line3.split(',')[1].strip()
    z3 = line3.split(',')[2].strip()
    ZX = (z1, z2)   # tuple again for dict
    ZY = (z2, z1)   # tuple
    if ZX in lookup_from or ZY in lookup_from:
        ZX = ZX + (z3,)
        l.append(ZX)
        print(ZX)

for i in l:
    output.write(str(i)[1:-1]+'\n')
output.close()

Expected output:

1. 22_0F3, 33_0F4, 0.87
2. 32_2F1, 21_0F2, 0.86

Additionally to improve speed, you could reduce from two lookups to one. Right now you are checking (X, Y) and (Y, X), but instead you could always put in your lookups in a specific order (alphabetically perhaps), and then always lookup using that order as well.

MyNameIsCaleb
  • 4,409
  • 1
  • 13
  • 31