0

I have a very big file (~10GB) and I want to read it in its wholeness. In order to achieve this, I cut it into chunks. However, I have troubles cutting the big file into exploitable pieces: I want thousands lines together without having them splitted in the middle. I have found a function here on SO that I have arranged a bit:

def readPieces(file):
    while True:
        data = file.read(4096).strip()
        if not data:
            break
        yield data

with open('bigfile.txt', 'r') as f:
    for chunk in readPieces(f):
        print(chunk)

I can specify the bytes I want to read (here 4MB) but when I do so my lines get cut in the middle, and if I remove it, it'll read the big file that will lead to a process stop. How can I do this? Also, the lines in my file haven't equal size.

ooj-001
  • 180
  • 9
  • 4
    If you want to read the files by line, just use `for line in open('bigfile.txt'):`. It contains a lot of auto-magic. – Klaus D. Jun 05 '19 at 14:41
  • 2
    If you __really__ want to implement this yourself, you can (in `readPieces`) split the chunk on the last newline, keep the second part in a buffer and only yield the first part. Then on the next iteration, you add the new chunk to the buffer, rinse, lather, repeat (and do not forget to yield the remaining buffer - if not empty - once you've exhausted the file). But just using the builtin line-buffered read (as explained in the dup) is definitly simpler and more efficient xD – bruno desthuilliers Jun 05 '19 at 15:16
  • @brunodesthuilliers Yes, in fact this is what I was aiming for: read my huge file like a stream – ooj-001 Jun 05 '19 at 15:28
  • @ooj-001 well as explained here and in the dup: just iterate over the file, and you'll be done - it __does__ stream the file's content already. – bruno desthuilliers Jun 08 '19 at 09:30

1 Answers1

1

The following code reads the file line by line, the previous line gets garbage collected.

with open('bigfile.txt') as file:
  for line in file:
    print(line)
peter
  • 41,770
  • 5
  • 64
  • 108