2

I'm having some trouble dealing with large text files (about 1GB), when I want to read them and use them in while loops.

More specifically: First I start by doing some parsing on the lines of the file, in order to find e.g. all lines that start with "x". In doing so, I add the indices of the found lines to a list (say l). This is the pre-processing part.

Now in a while loop, I'm choosing random indices from l, and want to read its corresponding line (or say 5 lines around it). Thus I need to keep the file in memory once and for all throughout the while loop, as a priori I do not know what lines I end up reading (the line is randomly picked from l).

The problem is, when I call the file before my main loop, during the first run of the loop, the reading gets done successfully, but already from the second run, the file has vanished from memory. What I have tried:

The preprocess part:

for i, line in enumerate(filename):
    prep = ''.join(c for c in line if c.isalnum() or c.isspace())
    if 'x' in prep: l.append(i)

Now I have my l list. loading the file in memory before main loop:

with open(filename,'r') as f:
    while (some condition):
        random_index = random.sample(range(0,len(l)),1)
        output_file = open("out","w") #I will write here the read line(s)
        for i, line in enumerate(f):
            #(the lines to be read, starting from the given random index)
            if (i >= l[random_index]) and (i < l[random_index+1]): 
                out.write(line)
        out.close()

Only during the first run of the loop, things work properly. Alternatively I also tried:

f = open(filename)
while (some condition):
    random_index = ... #rest is same as above.

Same issue, only first run work. One thing that worked was putting the f=open(filename) in the loop, so every run the file is called. But since it is a large one, this is really no practical solution.

  • What am I doing wrong here?
  • How should such readings be done properly?

1 Answers1

1

What am I doing wrong here?

This answer addresses the same problem: you can't read file twice.

You open file f outside of the while loop and read it completely by calling for i, line in enumerate(f): during first iteration of the while loop. During the second iteration you can't read it again, since it has been read already.

How should such readings be done properly?

As noted in the linked answer:

To answer your question directly, once a file has been read, with read() you can use seek(0) to return the read cursor to the start of the file (docs are here).

That means, that to solve your problem you can add f.seek(0) at the end of the while loop to move pointer to the start of the file after each iteration. Doing this you can reread file from the start again.

Community
  • 1
  • 1
Yaroslav Admin
  • 13,880
  • 6
  • 63
  • 83
  • Thanks, I see now. The link you provided is useful, I'm going to try [Dan Lenski's suggestion](http://stackoverflow.com/questions/24312123/memory-efficent-way-to-iterate-over-part-of-a-large-file/24312242#24312242) and try using islice from itertools, in the loop. –  Oct 29 '15 at 14:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93819/discussion-between-yaroslav-admin-and-user929304). – Yaroslav Admin Oct 30 '15 at 15:14