Python - How to read a specific line in a text file?

Question

I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.

start_line = b
num_lines = 377763316

while b < num_lines:
  plasmid1 = linecache.getline("Result.txt", b-1)
  plasmid1 = plasmid1.strip("\n")
  plasmid1 = plasmid1.split("\t")

  plasmid2 = linecache.getline("Result.txt", b)
  plasmid2 = plasmid2.strip("\n")
  plasmid2 = plasmid2.split("\t")


    if not str(plasmid1[0]) == str(plasmid2[0]):
      end_line = b
      #do something

The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.

I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!

Thanks, Philipp

`linecache` is not designed for this. From the source code: "*Cache lines from Python source files*". Yes, from looking at the source code `linecache` does reopen the file each time. https://hg.python.org/cpython/file/3.6/Lib/linecache.py — cdarke, Feb 25 '17 at 18:28

score 0 · Answer 1 · edited May 23 '17 at 12:25

You should open the file just once, and iterate over the lines.

with open('Result.txt', 'r') as f:
    aline = f.next()
    currentid = aline.split('\t', 1)[0]
    for nextline in f:
        nextid = nextline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid

You get the idea, just use plain python. Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.

If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:

with open('Result.txt', 'r') as f:
    aline = f.readline()
    currentid = aline.split('\t', 1)[0]
    while aline != '':
        aline = f.readline()
        nextid = aline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid

Thanks for your comment! I receive following error: AttributeError: '_io.TextIOWrapper' object has no attribute 'next' Any ideas? — Philipp, Mar 02 '17 at 12:25

score 0 · Answer 2 · answered Feb 25 '17 at 18:21

I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.

After calling loadtxt() you will get ndarray back.

score 0 · Answer 3 · answered Feb 25 '17 at 18:41

You can use itertools:

from itertools import takewhile

class EqualityChecker(object):
   def __init__(self, id):
       self.id = id

   def __call__(self, current_line):
       result = False
       current_id = current_line.split('\t')[0]

       if self.id == current_id:
           result = True

       return result


with open('hugefile.txt', 'r') as f:
   for id in ids:
       checker = EqualityChecker(id)
       for line in takewhile(checker, f.xreadlines()):
           do_stuff(line)

In outer loop id can actually be obtain from the first line with an id non-matching previous value.

Python - How to read a specific line in a text file?

3 Answers3