0

I have a huge text file (12GB). The lines are tab delimited and the first column contains an ID. For each ID I want to do something. Therefore, my plan is to go start with the first line, go through the first column line by line until the next ID is reached.

start_line = b
num_lines = 377763316

while b < num_lines:
  plasmid1 = linecache.getline("Result.txt", b-1)
  plasmid1 = plasmid1.strip("\n")
  plasmid1 = plasmid1.split("\t")

  plasmid2 = linecache.getline("Result.txt", b)
  plasmid2 = plasmid2.strip("\n")
  plasmid2 = plasmid2.split("\t")


    if not str(plasmid1[0]) == str(plasmid2[0]):
      end_line = b
      #do something

The code works, but the problem is that linecache seems to reload the txt-file every time. The code would run several years if I don't increase the performance.

I appreciate your help if you have a good idea how to solve the issue or know an alternative approach!

Thanks, Philipp

Philipp
  • 15
  • 7
  • Lines are tab- delimited? Sounds like columns to me? – RuDevel Feb 25 '17 at 18:19
  • Please, show all the code. What is `linecache` – eguaio Feb 25 '17 at 18:20
  • @eguaio: https://docs.python.org/3/library/linecache.html – cdarke Feb 25 '17 at 18:25
  • 1
    `linecache` is not designed for this. From the source code: "*Cache lines from Python source files*". Yes, from looking at the source code `linecache` does reopen the file each time. https://hg.python.org/cpython/file/3.6/Lib/linecache.py – cdarke Feb 25 '17 at 18:28

3 Answers3

0

You should open the file just once, and iterate over the lines.

with open('Result.txt', 'r') as f:
    aline = f.next()
    currentid = aline.split('\t', 1)[0]
    for nextline in f:
        nextid = nextline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid

You get the idea, just use plain python. Only one line is read in each iteration. The extra 1 argument in the split will split only to the first tab, encreasing performance. You will not get better performance with any specialized library. Only a plain C language implementation could beat this approach.

If you get the AttributeError: '_io.TextIOWrapper' object has, it is probably because you are using Python 3.X (see question io-textiowrapper-object). Try this version instead:

with open('Result.txt', 'r') as f:
    aline = f.readline()
    currentid = aline.split('\t', 1)[0]
    while aline != '':
        aline = f.readline()
        nextid = aline.split('\t', 1)[0]
        if nextid != currentid:
            #do stuff
            currentid = nextid
Community
  • 1
  • 1
eguaio
  • 3,754
  • 1
  • 24
  • 38
0

I think numpy.loadtxt() is the way to go. Also it would be nice to pass usecols argument to specify which columns you actually need from the file. Numpy package is solid library written with high performance in mind.

After calling loadtxt() you will get ndarray back.

Laszlowaty
  • 1,295
  • 2
  • 11
  • 19
0

You can use itertools:

from itertools import takewhile

class EqualityChecker(object):
   def __init__(self, id):
       self.id = id

   def __call__(self, current_line):
       result = False
       current_id = current_line.split('\t')[0]

       if self.id == current_id:
           result = True

       return result


with open('hugefile.txt', 'r') as f:
   for id in ids:
       checker = EqualityChecker(id)
       for line in takewhile(checker, f.xreadlines()):
           do_stuff(line) 

In outer loop id can actually be obtain from the first line with an id non-matching previous value.

mshrbkv
  • 309
  • 1
  • 5