3

I am iterating over 80m lines in a 2.5gb file to create a list of offsets for the location of the start of each line. The memory slowly increases as expected until I hit around line 40m, and then rapidly increases 1.5gb in 3-5 seconds before the process exits due to lack of memory.

After some investigation, I discovered that the blow-up occurs around the time when the current offset (curr_offset) is around 2b, which happens to be around my sys.maxint (2^31-1).

My questions are:

  • Do numbers greater than sys.maxint require substantially more memory to store? If so, why? If not, why would I be seeing this behavior?
  • What factors (e.g. which Python, which operating system) determine sys.maxint?
    • On my 2010 MacBook Pro using 64-bit Python, sys.maxint is 2^63-1.
    • On my Windows 7 laptop using 64-bit IronPython, sys.maxint is the smaller 2^31-1. Same with 32-bit Python. For various reasons, I can't get 64-bit Python on my Windows machine right now.
  • Is there a better way to create this list of offsets?

The code in question:

f = open('some_file', 'rb')
curr_offset = 0
offsets = []
for line in f:
    offsets.append(curr_offset)
    curr_offset += len(line)
f.close()
  • Is this about `Mersenne prime` ? –  Jan 04 '15 at 04:32
  • I’m not exactly sure on how the internals work, but numbers greater than `sys.maxint` are automatically stored as `long`s which (theoretically) allow numbers of ininite digits. Those grow automatically in size, and they seem to be efficient enough that in Python 3 the old `int` was removed and all ints became longs instead. – poke Jan 04 '15 at 04:32
  • Are you sure this happens because of the size of `curr_offset`? To me, it would rather appear that the size of your `offsets` list hits your machine’s physical RAM limit with 40 million elements. How much RAM do you have, and how much is the program using at that point? – poke Jan 04 '15 at 04:38
  • user2357112, As far as I know, I'm just reading one line at a time and discarding it. – Henrik Lang Jan 04 '15 at 04:39
  • @MarkRansom: Oh, whoops, he's appending the offsets to the list; I thought he was appending the lines. – user2357112 Jan 04 '15 at 04:39
  • @HenrikLang: My mistake. I misread your code. – user2357112 Jan 04 '15 at 04:40
  • If the size of the `offsets` list turns out to be a problem, consider storing the offsets in a file instead if you need to process them later. – poke Jan 04 '15 at 04:41
  • 1
    Side note: you could use `with open(…) as f`, for an automatic closing of your file and a better handling of exceptions. – Eric O. Lebigot Jan 04 '15 at 04:41

6 Answers6

2

Integers larger than sys.maxint will require more memory as they're stored as longs. If your sys.maxint is just 2GB, you're using a 32-bit build -- download, install, and use, a 64-bit build, and you'll avoid the problem. Your code looks fine!

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    For 64-bit Windows, `sys.maxint == 2**31 - 1`. A CPython [`PyIntObject`](https://hg.python.org/cpython/file/648dcafa7e5f/Include/intobject.h#l23) uses a C `long`, which is always 32-bit on Windows. – Eryk Sun Jan 04 '15 at 05:01
  • @eryksun on my windows 7 machine having 32 bit python interpreter `>>> sys.maxint 2147483647`and `>>> sys.maxint == 2**31-1 True` but when i did calculation `>>> 2**31-1 2147483647L`, why it's making it of type long as it equals to `sys.maxint` ? Please explain me this behaviour – Tanveer Alam Jan 04 '15 at 12:52
  • 1
    @TanveerAlam, om all platforms, `sys.maxint + 1 - 1` is a `long` -- because so obviously is `sys.maxint + 1`, and a `long` operated with an `int` is a long. So, in your case, `2**31` is a `long`, and therefor so is the expression `2**31 - 1`. Incidentally, `sys.maxint - 1 + 1`, on any platform, is an `int`, as the subtraction is done first so the intermediate result fits in an `int`. – Alex Martelli Jan 04 '15 at 15:33
  • 2
    @TanveerAlam, just to clarify, I referred to a C `long` being 32-bit, not a Python `long`. A CPython `int` is stored as a C `long`. Thus since a C `long` is always 32-bit on Windows, the suggestion to use a 64-bit build won't help there. – Eryk Sun Jan 04 '15 at 17:33
  • @eryksun Thanks for the information about `long` is always 32 bit on windows. – Tanveer Alam Jan 04 '15 at 20:39
  • @AlexMartelli Thanks for you wonderful explaination. I get that `long` operated with `int` would result in `long`. But suppose `>>> sys.maxint+1 - sys.maxint 1L` the result we get here is `1L` which is in range of `sys.maxint` then why it still holds up the type `long` why is dynamically cast to `int`. I'm just curious to know about that. Because if I want `1L` to in `int` type I can use `int(1L)` to get integer value. But why it didnt happens dynamically ? – Tanveer Alam Jan 04 '15 at 20:44
  • 1
    @TanveerAlam, it's a simple design choice (in `longobject.c` in the CPython 2 sources): all result of `long` operations are returned via `long_normalize` and that is of course much faster by not checking (and possibly having to allocate a new `int` `PyObject*` and copy the value over). `sys.getsizeof(1)` is 24, `sys.getsizeof(1L)` is 28, it's hard to argue that saving those 4 bytes is worth slowing down **every** operation on `long`s!-) – Alex Martelli Jan 04 '15 at 21:37
  • @AlexMartelli I totally get that conversion of`Long`in CPython by allocating it new `int PyObject*` is not worth to save 4bytes.Ihave one last question and it may be off the topic but I really need to know.I learned Python and never got a chance to Learn CPython.So what is CPython,how it is different from Python and where it is used?I work on Django & know that it's built on Python.But how learning CPython would help me in practical perspectives?I know there may many answer available,but it would be really helpful to get answer from you because you have way much experience and expert level.TY – Tanveer Alam Jan 05 '15 at 02:36
  • 1
    @TanveerAlam, CPython is the standard and by far most widespread implementation of Python -- that's what you most likely do know. There are other interesting implementations (PyPy, IronPython, Jython), but CPython (so called because its runtime is coded in C, rather than in Python, C#, and Java, for the other three in order) is still "the gold standard". – Alex Martelli Jan 05 '15 at 03:45
  • @AlexMartelli So were do we pratically use CPython for building some application? Or we can use CPython in place of Python itself? Thanks a lot for your help. – Tanveer Alam Jan 05 '15 at 03:55
  • We're using CPython each and every time we run a program called `python` -- if we were using other implementations of Python they'd have different names, such as `ironpython` &c. – Alex Martelli Jan 05 '15 at 04:02
2

Here is a solution that works even with 32-bit Python versions: store the lengths of the lines (they are small), convert into a NumPy array of 64-bit integers, and only then calculate offsets:

import numpy
with open('some_file', 'rb') as input_file:
    lengths = map(len, input_file)
offsets = numpy.array(lengths, dtype=numpy.uint64).cumsum()

where cumsum() calculates the cumulative sum of line lengths. 80 M lines will give a manageable 8*80 = 640 MB array of offsets.

The building of the lengths list can even be bypassed by building the array of lengths with numpy.fromiter():

import numpy
with open('some_file', 'rb') as input_file:
    offsets = numpy.fromiter((len(line) for line in input_file), dtype=numpy.uint64).cumsum()

I guess that it should be hard to find a faster method, because using a single numeric type (64-bit integers) makes NumPy arrays faster than Python lists.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • Alternatively, do it in multiple chunks of numpy arrays and then join them together at the very end. Numpy arrays are much faster and more memory-efficient than ordinary lists. – Rufflewind Jan 04 '15 at 04:50
  • True, but a better option is to let NumPy do this for you: this is essentially what `fromiter()` does, I would guess. Your comment inspired me to add this solution. :) – Eric O. Lebigot Jan 04 '15 at 04:56
1

An offset in a 2.5 GB file should never need to be more than eight bytes. Indeed, a signed 64-bit integer is maximally 9223372036854775807, much greater than 2.5G.

If you have 80 million lines, you should require no more than 640 MB to store an array of 80M offsets.

I would investigate to see if something is buggy with your code or with Python, perhaps using a different container (perhaps an explicit numpy array of 64-bit integers), using a preinitialized list, or even a different language altogether to store and process your offsets, such as an off_t in C, with appropriate compile flags.

(If you're curious and want to look at demo code, I wrote a program in C called sample that stores 64-bit offsets to newlines in an input file, to be able to do reservoir sampling on a scale larger than GNU sort.)

Community
  • 1
  • 1
Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
0

Yes. Above some threshold, python is representing long numbers as bignums and they take space.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
0

If you really can't use a 64-bit version of Python, you can hold your calculated offsets in a NumPy array of numpy.uint64 numbers (maximum value of 2**64-1). This is a little inconvenient, because you have to dynamically extend the array when it reaches capacity, but this would solve your problem and would run on any version of Python.

PS: A more convenient NumPy-based solution, that does not require dynamically updating the size of the NumPy array of offsets, is given in my other answer.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
0

Appending to a list will reallocate a buffer for the list once it passes the capacity of the current buffer. I don't know specifically how Python does it, but a common method is to allocate 1.5x to 2x the previous size of the buffer. This is an exponential operation, so it's normal to see the memory requirements increasing rapidly near the end. It might be the case that the size of the list is just too large overall; a quick test would be to replace curr_offset += len(line) with curr_offset += 1 and see if you have the same behavior.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    While it might be exponential, it shouldn't overestimate the size of the memory by more than a factor of two. – Rufflewind Jan 04 '15 at 04:48
  • The memory requirement should not increase rapidly near the end, as the throughput (of additions to the list of offsets) remains the same. It is true that it should make exponentially bigger jumps, though, but they also become exponentially rarer. – Eric O. Lebigot Jan 04 '15 at 04:50
  • @EOL if you're watching the memory usage it's going to seem like it takes a sudden jump, and you might not notice that the jumps are coming less often. – Mark Ransom Jan 04 '15 at 04:52
  • @Rufflewind I'd prefer to turn that statement around: the memory requirements might be overestimated by up to a factor of two. – Mark Ransom Jan 04 '15 at 04:55