I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.
After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).
My questions are:
- Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior?
- What factors (e.g. which Python, which operating system) determine sys.maxint?
- On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1.
- On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now.
- Is there a better way to create this list of offsets?
The code in question:
f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
offsets.append(curr_offset)
curr_offset += len(line)
f.close()