2

I have a very large data file (about 70 Mb) and I want to process the file. I have tried to process it in chunks, but it is still slow. I want to use numpy.frombuffer to queue the data and work with around 1 Mb of data at a time, so that it wont fill up the memory.

I am getting this error:

buffer size must be a multiple of element size

A sample input would be like this: array([0, 0, 0, 0, 0], dtype=int16).

Dag Høidahl
  • 7,873
  • 8
  • 53
  • 66
B. Z.
  • 129
  • 11
  • 1
    70Mb doesn't seem like it would be enough to even come close to filling up your memory on most modern machines. What hardware are you running this on? – mgilson Jan 26 '16 at 19:19
  • 1
    *"...it is still slow."* *What* is slow? Reading the file? Processing the data? What are you doing with the data? Also, is the file in a binary format or text? More information is needed. – Warren Weckesser Jan 26 '16 at 19:26
  • if the matrix is too large to fit in core RAM, have you considered trying numpy's memmapped array structure? It supports convenient slicing like a 'regular' in-memory array, but reads only the accessed slice from disk. [link to numpy docs](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) – svohara Jan 26 '16 at 20:22
  • 1
    What kind of processing are you doing? And what's the format of the data file (csv, binary, np.save,...)? – hpaulj Jan 26 '16 at 21:08
  • @mgilson I am using intel core i5, with 4G RAM. – B. Z. Jan 26 '16 at 21:30
  • 1
    It looks like 70Mb is ~1.75% of your memory. Why not pull it all in? – mgilson Jan 26 '16 at 21:33
  • @svohara slow as in it takes about 20 minutes to give out an error, sometimes it kills the python script. So I am trying to input a data of IQ values in int16 format, and then perform some signal processing on that data. But if I input the whole data, it won't even finish running the script, it just kills it. You are right, I believe the matrix is too large to fit in the core RAM, so I have to study on how to use memmapped array structure. – B. Z. Jan 26 '16 at 21:35
  • @hpaulj the input data is series of IQ values and am doing some signal processing and then saving it to a new file. – B. Z. Jan 26 '16 at 21:36
  • @B.Z. - Reading in 70MB worth of `int16`s should be _very_ fast. Can you show some code? You may be doing things in a rather inefficient way. – Joe Kington Jan 26 '16 at 21:45
  • @JoeKington Here is the part where I input the file, this by itself takes about 10 minutes. `beginner = 0` `ender = beginner + 50000000` `lenData = ender - beginner` `dShort = np.fromfile('/home/sourcedata.I', dtype=np.int16)` `data = np.array([np.complex(dataShort[i], dataShort[i+1]) for i in np.arange(beginner*2,(lenData*2+beginner*2)-1,2)])` The beginner and ender are just indicative of where exactly to begin when inputting the file. In this case, starting from the first byte, go for 50Mb. – B. Z. Jan 26 '16 at 22:17
  • @B.Z. - Well, there's your problem. Loading in the file is very fast. However, you're creating a big temporary list and iterating through numpy arrays, which is rather slow. Instead of the list comprehension, you want something like `data = dataShort[::2] + dataShort[1::2]*1j` – Joe Kington Jan 26 '16 at 22:18
  • @JoeKington Perfect, it finished in 4 seconds. Thank you so much. – B. Z. Jan 26 '16 at 22:23
  • @JoeKington Both methods, slicing and creating array of zeros, work very well when I am reading the file. However, I get memory error when I try to process the file. The size of the file I am trying to process is 645MB and the free memory I have the moment I try to execute the script is about 5GB. after reading the file, free memory goes down to about 2.7GB, which doesn't leave much for processing. I was wondering if there is a way not to create any other list or array and just modify the imported file by a reference or if I have to use a buffer to input part of the file at a time for processi – B. Z. Jan 28 '16 at 17:05
  • @B.Z. - If the array is 645MB on disk, the array it creates in memory will also be 645MB. However, when you convert to `complex128` format, it will be ~2.5GB. You'll need to be careful about the processing you're doing if you want to work with it in-memory. What exactly are you doing? Be sure to use in-pace operations (e.g. `x += 1` instead of `x = x + 1`) if you want to modify it without needing more ram. See: http://stackoverflow.com/questions/4370745/view-onto-a-numpy-array/4371049#4371049 for advice, also. – Joe Kington Jan 28 '16 at 17:37
  • @JoeKington Thank you for your reply. I am trying to sample the data at a different sample rate. So after I have the complex format of data, I am inserting zeros and then running it through a low pass filter to get the sampled data. I wonder if there is a way to get rid of the original data after the sampling is done so that it can free up memory. – B. Z. Jan 28 '16 at 18:14
  • @B.Z. - That sounds a possibly inefficient way to re-sample the data. Why not some of the standard resampling methods? As far as deleting the original data, it will be garbage collected when it goes out of scope, or you can use the `del` statement. – Joe Kington Jan 28 '16 at 18:18
  • @JoeKington I tried interpolation function from scipy but it was slow. May be it was because of the way I read the file, so I will try it again. – B. Z. Jan 28 '16 at 18:27
  • @B.Z. - You're probably looking for `scipy.signal.resample` instead of the functions in `scipy.interpolate`. The latter is meant for irregularly sampled data and will be quite slow. `scipy.signal.resample` is meant for regularly-sampled data and operates in the fourier domain. It will be fast, but could be too memory-intensive for your case, though, as it will need to compute the full fourier transform. – Joe Kington Jan 28 '16 at 18:30
  • @JoeKington scipy.signal.resample works good. but it still consumes too much memory so I have to do it in blocks. I couldn't find a good thread about numpy.frombuffer or Queue.deque(). please let me know if you know one or if there is a better solution. Thank you. – B. Z. Feb 08 '16 at 20:14

1 Answers1

0

Based on your comment, it sounds like you're reading in an array of int16s and then "de-interleaving" them into complex numbers.

You're currently doing something like:

d_short = np.fromfile(filename, dtype=np.int16)
data = np.array([np.complex(d_short[i], d_short[i+1]) for i in np.arange(...)])

The slow part is the second line.

You're creating a big temporary list, and creating it by iterating through a numpy array. Iterating through a numpy array in Python is much slower than iterating through a list. Always avoid it where you can. Furthermore, the list comprehension will result in a temporary list that's much larger in memory than the original array.

Instead of iterating through, use slicing. In this case, it's equivalent to:

data = d_short[::2] + d_short[1::2] * 1j

This will create a temporary array, but it shouldn't be an issue. However, if you're really concerned about memory usage, you might consider something like:

data = np.zeros(d_short.size // 2, dtype=np.complex64)
data.real, data.imag = d_short[::2], d_short[1::2]

While this is considerably less readable, it does have some advantages.

  1. No temporary array is created, so we only need the amount of memory used by d_short and data
  2. Instead of creating a np.complex128 array, we're creating a np.complex64 array (two 32-bit floats), which will use half the memory. Because you're inputting 16-bit ints, there's no loss of precision.
Joe Kington
  • 275,208
  • 71
  • 604
  • 463