5

I am trying to convert a file containing more than 1 billion bytes into integers. Obviously, my machine cannot do this at once so I need to chunk my code. I was able to decode the first 50,000,000 bytes but I am wondering how to read the integers in the file that are between 50,000,001 and 100,000,000, 150,000,000 and 200,000,000 etc. The following is what I have now;the range function is not working with this.

import struct
with open(x, "rb") as f:
    this_chunk = range(50000001, 100000000)
    data = f.read(this_chunk)
    ints1 = struct.unpack("I" * (this_chunk //4) , data)
    print(ints1)
cs95
  • 379,657
  • 97
  • 704
  • 746
A. Hartman
  • 223
  • 1
  • 4
  • 10
  • 1
    `file.read()` already supports defining the size you need to read, `f.read(50000000)` will return `50000000` at a time. Doesn't the above code throw an error, I would expect passing a `range()` object into `f.read()` would error. – AChampion Jul 19 '17 at 21:11
  • @AChampion Yes, it does. The problem is if I simply enter 50000000 into f.read it continually outputs the same numbers – A. Hartman Jul 19 '17 at 21:17
  • 1
    That's because you open the file each time. Only open the file once and use `f.read()` multiple times, i.e. stick it in a loop. – AChampion Jul 19 '17 at 21:18

2 Answers2

7

You can use f.seek(offset) to set the file pointer to start reading from a certain offset.

In your case, you'd want to skip 5000000 bytes, so you'd call

f.seek(50000000)

At this point, you'd want to read another 50000000 bytes, so you'd call f.read(50000000).


This would be your complete code listing, implementing f.seek and reading the whole file:

with open(x, "rb") as f:
    f.seek(50000000) # omit if you don't want to skip this chunk
    data = f.read(50000000)
    while data:
        ... # do something 
        data = f.read(50000000)
cs95
  • 379,657
  • 97
  • 704
  • 746
  • The first byte is at position `0`, so I'm pretty sure seeking to `50000000` is more correct than going one byte further. – Blckknght Jul 19 '17 at 21:11
  • @Blckknght Sorry, and thanks for pointing that out. Corrected. – cs95 Jul 19 '17 at 21:12
  • 1
    Why `f.seek()`, can't the OP just keep reading `f.read(50000000)` and working on each `50000000` bytes at a time? Just stick it in a loop. – AChampion Jul 19 '17 at 21:13
  • @AChampion OP mentioned that they had already done this before, so they wanted to read the next chunk in this program. At least that's what I understood. – cs95 Jul 19 '17 at 21:14
  • @cᴏʟᴅsᴘᴇᴇᴅ Would the program run until the end of the file offsetting each time? – A. Hartman Jul 19 '17 at 21:15
  • @A.Hartman Do you just want to read one chunk here? Or do you want to read the whole file in chunks as of now? Depending on that, your code would change. – cs95 Jul 19 '17 at 21:16
  • You just need to seek once. After that `f.read` automatically increments the file pointer past the number of bytes most recently read. So the answer is, if you want to read all chunks starting from 50million, then you call `f.seek(50m)` once and then `f.read(50m)` as many times as needed. – cs95 Jul 19 '17 at 21:16
  • @cᴏʟᴅsᴘᴇᴇᴅ I want to the read the whole file in chunks, and okay that makes sense. – A. Hartman Jul 19 '17 at 21:19
  • @cᴏʟᴅsᴘᴇᴇᴅ Thank you for all of your help! It truly means a lot! – A. Hartman Jul 19 '17 at 21:27
3

Use f.read(50000000) in a loop at it will read the file in chunks of 50000000, e.g.:

In []:
from io import StringIO

s = '''hello'''
with StringIO(s) as f:
    while True:
        c = f.read(2)
        if not c:
            break
        print(c)

Out[]:
he
ll
o
AChampion
  • 29,683
  • 4
  • 59
  • 75