5

My situation is like this:

  1. I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)

  2. I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.

  3. I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.

  4. I store this object it in a Database

Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.

So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.

EDIT: Edited for clarity about size and type of data.

user1265125
  • 2,608
  • 8
  • 42
  • 65
  • "around ~70 mil" - 70 what? 70 million bytes of data? 70 million data points? 70 of something that abbreviates to "mil"? Could you expand that point a bit? – user2357112 Aug 20 '14 at 07:38
  • Take a look at [pandas](https://www.google.co.il/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0CCAQFjAB&url=http%3A%2F%2Fpandas.pydata.org%2F&ei=H1T0U-3BJcfb4QSq8YDIDQ&usg=AFQjCNEZuvyE0M5ZO8tUGa-NnbWMi3Qh0A&sig2=JkKgHptG5DEt7ux2V1cr2g&bvm=bv.73231344,bs.1,d.bGQ) – Korem Aug 20 '14 at 07:54
  • 2
    @Korem how would using pandas solve the memory storage issue? Pandas uses numpy and there will be additional overhead for the pandas structures so it will increase the memory usage. The idea here would be to create your structures once if possible, growing them a row at a time is inefficient. Your question is a bit broad, but this may help: http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – EdChum Aug 20 '14 at 08:00
  • Also related: http://stackoverflow.com/questions/16660617/what-is-the-advantage-of-pytables – EdChum Aug 20 '14 at 08:00
  • @EdChum I do not see anywhere that the OP is concerned about the actual size in memory. What I read from the question is that he is concerned about efficient loading **into** memory (talking about append, etc.). pandas has efficient methods for loading large amounts of data. As a side note, this was a pointer at a package the OP might not be familiar with. It was not an answer. – Korem Aug 20 '14 at 08:07
  • @Korem `My first instinct was to use Numpy for more efficient arrays w.r.t memory` my interpretation of this line is that memory efficient = size. Pandas would still suffer the same issues as numpy for appending row by row. Still this question lacks detail – EdChum Aug 20 '14 at 08:10
  • So mil means millions ? – log0 Aug 20 '14 at 09:11
  • 2
    Yes I'm not concerned with the actual size in memory because it's gonna be impossible to get around that. My main concern is the 'append' operation on a large numpy array. – user1265125 Aug 20 '14 at 09:18
  • 1
    Did you think about preallocating a numpy array then assigning values? `my_array = numpy.zeros(length); for i, line in enumerate(file): ... my_array[i] = ...` – log0 Aug 20 '14 at 09:23

1 Answers1

5

If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.


If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.

Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.

ali_m
  • 71,714
  • 23
  • 223
  • 298