2

I don't think this is possible but I figured I would ask just in case. So I am trying to write a memory efficient python program for parsing files that are typically 100+ gigs in size. What I am trying to do is use a for loop to read in a line, split on various characters multiple times and write it all within the same loop.

The trick is that the file has lines that start with "#" which is not important except for the last line that starts with a "#" which is the header of the file. I want to be able to pull information from that last line because it contains the sample names.

for line in seqfile:
line = line.rstrip()
if line.startswith("#"):
    continue (unless its the last line that starts with #)
    SampleNames = lastline[8:-1]
    newheader.write(New header with sample names)
else:
    columns = line.split("\t") 
    then do more splitting
    then write

If this is not possible then the only other alternative I can think of it to store the lines with # (which can still be 5 gigs in size) then go back and write to the beginning of the file which I believe which can't be done directly but if there is a way to do that memory efficiently it would be nice.

Any help would be greatly appreciated.

Thank you

MeeshCompBio
  • 109
  • 1
  • 9

1 Answers1

3

If you want the index of the last line starting with #, read once using takewhile, consuming lines until you hit the first line not starting with # then seek and use itertools.islice to get the line:

from itertools import takewhile,islice

with open(file) as f:
    start = sum(1 for _ in takewhile(lambda x: x[0] == "#",f)) -1
    f.seek(0)
    data = next(islice(f,start, start+1))
    print(data)

The first arg to takewhile is a predicate which while the predicate is True takewhile will take elements from the iterable passed in as the second argument, because a file object returns it's own iterator when we consume the takewhile object using sum the file pointer is now pointing to the very next line after the header line you want so it is just a matter of seeking back and getting the line with islice. You can obviously also seek much less if you just want to go back to the previous line and take a few lines with islice filtering out until you reach the last line starting with a #.

file:

###
##
# i am the header
blah
blah
blah

Output:

 # i am the header

The only memory efficient way I could think of if the line could be anywhere would mean reading the file once always updating an index variable when you had a line starting with #, then you could pass the to islice as in the answer above or use linecache.getline as in this answer:

import linecache

with open(file) as f:
    index = None
    for ind, line in enumerate(f, 1):
        if line[0] == "#":
            index = ind
    data = linecache.getline(file, index)
    print(data)

We use a starting index of 1 with enumerate as getline counts starting from 1.

Or simply update a variable data which will hold each line starting with a # if you only want that particular line and don't care about the position or the other lines:

with open(file) as f:
     data = None
    for line in f:
        if line[0] == "#":
            data = line
    print(data) # will be last occurrence of line starting with `#`

Or using file.tell, keeping tack of the previous pointer location and using that to seek then call next on the file object to get the line/lines we want:

with open(file) as f:
    curr_tell, prev_tell = None, None
    for line in iter(f.readline, ""):
        if line[0] == "#":
            curr_tell = prev_tell
        prev_tell = f.tell()
    f.seek(curr_tell)
    data  = next(f)
    print(data)
    # i am the header

There is also the consume recipe from the itertools code that you could use to consume the file iterator up to your header line index -1 then simply call next on the file object:

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • Yes that is basically the issue I have, its just continuous lines starting with # then it switches to the actual data. How does the code change if you use your previous line example? – MeeshCompBio Jun 02 '15 at 19:20
  • @MeeshCompBio, then this will work, it will only read as many lines that start with a `#` and give you the count of those which taking one away from the sum will give you the index of the last line that started with a `#` – Padraic Cunningham Jun 02 '15 at 19:22
  • @MeeshCompBio, if the line could be anywhere then you would need to read every line, line by line using enumerate and updating a variable with the index, I will add it to my answer. – Padraic Cunningham Jun 02 '15 at 19:23
  • Thank you, this is super helpful! – MeeshCompBio Jun 02 '15 at 19:25
  • @MeeshCompBio,no worries, you can play around with the solutions maybe use timeit to see which is the best but they are all memory efficient. – Padraic Cunningham Jun 02 '15 at 19:42