0

This is very odd

I'm reading some (admittedly very large: ~2GB each) binary files using numpy libraries in Python. I'm using the:

thingy = np.fromfile(fileObject, np.int16, 1)

method. This is right in the middle of a nested loop - I'm doing this loop 4096 times per 'channel', and this 'channel' loop 9 times for every 'receiver', and this 'receiver' loop 4 times (there's 9 channels per receiver, of which there are 4!). This is for every 'block', of which there are ~3600 per file.

So you can see, very iterative and I know it will take a long time, but it was taking a LOT longer than I expected - on average 8.5 seconds per 'block'.

I ran some benchmarks using time.clock() etc. and found everything going as fast as it should be, except for approximately 1 or 2 samples per 'block' (so 1 or 2 in 4096*9*4) where it would seem to get 'stuck' on for a few seconds. Now this should be a case of returning a simple int16 from binary, not exactly something that should be taking seconds... why is it sticking?

From the benchmarking I found it was sticking in the SAME place every time, (block 2, receiver 8, channel 3, sample 1085 was one of them, for the record!), and it would get stuck there for approximately the same amount of time each run.

Any ideas?!

Thanks,

Duncan

Duncan Tait
  • 1,997
  • 4
  • 20
  • 24
  • Counting starting from 0 I presume? – Craig McQueen Feb 15 '10 at 12:49
  • Yep, so receivers 0-3, channels 0-7, samples 0-4095 – Duncan Tait Feb 15 '10 at 13:54
  • The problem with something like `fromfile()` is that it can't know in advance how much space to allocate, so with really large files you might be screwed. See my answer and some of the following comments in http://stackoverflow.com/questions/1896674/python-how-to-read-huge-text-file-into-memory for possible ideas on how to handle this, and the underlying problem. – Peter Hansen Feb 16 '10 at 01:30
  • Peter - thanks for that, the thing is I'm not trying to store everything simultaneously in any of it. I'm just reading manageable blocks of data (~2mb max), calculating stuff with them, writing the result to file then repeating that. It seems that maybe the one's I'm finished with aren't being disposed of/garbage collected. I'll try some of these solutions tomorrow when back at work. – Duncan Tait Feb 16 '10 at 20:56
  • ones I'm finished with aren't being disposed of -- try a `del xx` when done ? may gc sooner, may not – denis Feb 19 '10 at 17:29
  • @Duncan Tait, the reason I pointed out that issue is that using `fromfile()` means the array has to "grow" in some fashion, with lots of memory activity resulting. If you know in advance the size you need (which you appear to), you can pre-allocate, load much faster, and avoid the memory thrashing that seems to be your main problem. I think `fromfile()` might be, like `print` and `input()`, intended for simplistic situations. – Peter Hansen Feb 27 '10 at 15:58

3 Answers3

3

Although it's hard to say without some kind of reproducible sample, this sounds like a buffering problem. The First part is buffered and until you reach the end of the buffer, it is fast; then it slows down until the next buffer is filled, and so on.

Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • Yes this sounds likely, do you know of a way I could test this? Or what the likely buffer size is? – Duncan Tait Feb 15 '10 at 13:53
  • Well, one thing to at least determine whether gnibbler or I am closer to the solution is to run it and instantly throw away the results. If the slowdown still occurs, it's more likely to be a buffering problem. Then perhaps see if reading manually instead of through `numpy` changes anything. – Max Shawabkeh Feb 15 '10 at 16:49
  • Sure, sorry to keep asking new questions but how to you 'throw away' an object in Python? I've been trying to find out about dispose etc. for ages but can't find it anywhere. – Duncan Tait Feb 15 '10 at 20:13
  • You can use `gc.collect()` (http://docs.python.org/library/gc.html) to force a garbage collection, but what I meant was simply reading and not assigning the result to anything. – Max Shawabkeh Feb 15 '10 at 20:17
  • Another way to test if it is caused by file buffering is to change the buffer size in the file open() function. The default is to use the OS's default size. See if changing it changes where the pause happens. N.B. Setting the buffer size does not work on all OS - see the docs. – Dave Kirby Feb 15 '10 at 20:17
  • Max S: You may be right, the problem pretty much disappears when I don't allocate the instance (created to contain all the data) to anything, even though it still does all the processing. Is there anyway to see how big an object/list is in memory? I swear it shouldn't be that large, 2MB max, probably under 1MB, that really shouldn't be an issue should it... – Duncan Tait Feb 17 '10 at 16:18
  • Python has a quite large memory overhead when you have lots of small objects. On my 64-bit machine, an empty string take up 40 bytes, but each extra character take up only one more byte. See this answer for details: http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory/2212005#2212005 – Max Shawabkeh Feb 17 '10 at 16:29
2

Where are you storing the results? When lists/dicts/whatever get very large there can be a noticeable delay when they need to be reallocated and resized.

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Well, essentially they're all stored in lists within lists, and then entire set of data (per 'block') is stored in an instance of a class, along with header info. This shouldn't be more than a megabyte really though... Unless Python isn't disposing of lists? Can I force it to do this? – Duncan Tait Feb 15 '10 at 15:41
1

Could it be that garbage collection is kicking in for the lists ?

Added: is it funny data, or blockno ? What happens if you read the blocks in random order, along the lines

r = range(4096)
random.shuffle(r)  # inplace
for blockno in r:
    file.seek( blockno * ... )
    ...
denis
  • 21,378
  • 10
  • 65
  • 88