1

Suppose I have a very big np.array with N elements and I want to select only some values which pass S selectios. The usual way is:

selected_items = original_array[selection1(original_array) & original_array > 3]

this is fine but a lot of temporary memory is used. If I am correct I need S masks of booleans of size N, plus at least another one for the & result. Is there a better solution in terms of memory usage? For example an explicit loop don't need this:

selected_items = []
tests = (selection1, lambda x: x > 3)
for x in orignal_items:
    if all( (t(x) for t in tests) ):
       selected_items.append(x)

I like numpy, but its design is really memory eager, so it seems not suitable for processing of big data. On the other hands an explicit loop in python is not very performant.

Is there a solution with numpy?

Are there other python based framework for big data analysis?

Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141
  • IMHO, if you have space for `N`, it doesn't seem a big problem to require space for `3*N`. We are not talking about something scaling in memory like `O(N^2)`. And [This link](http://stackoverflow.com/questions/367565/how-do-i-build-a-numpy-array-from-a-generator) may be of interest – gg349 Jan 12 '14 at 23:21
  • This is a simple example. In my real life unfortunatelly numpy (as Matlab, Mathematica, ...) doen't seems to be the right tool, since they store all the values in memory. I have Genties and every entries has hundreds of fields. – Ruggero Turra Jan 12 '14 at 23:32
  • @RuggeroTurra Numpy can work with arrays on disk through [`np.memmap`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html) or you could also check out [PyTables](http://www.pytables.org/moin). – jorgeca Jan 13 '14 at 01:57

4 Answers4

2

Instead of looping over the items, you could build the selection "mask" in-place before using it to select the subset of data from the array. For example:

import numpy as np


x = np.arange(1, 100)

# x less than 75
selection = x < 75

# and greater than 35
selection &= x > 35

# and odd.
selection &= x & 1

print x[selection]

# [37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73]

It's not a perfect solution, but it might help.

Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

Bools are stored as one byte; unless you are cramming your whole memory full of uint8's, its unlikely to be that big of an issue, relatively speaking, especially if you make good use of in-place operators. But if your data barely fits into memory, it may be good to investigate on-disk storage that can efficiently perform queries of this kind. pytables springs to mind; especially with regards to your more general question about python frameworks for big data.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
0

An other python based framework, with main use of mathematics is SAGE. There are lots of algorithms built in, including sorting and search ones. I'm very recently engaged with it for RSA modelling but maybe you should give it a try for your problem.

Jimx
  • 90
  • 1
  • 7
0

A boolean selection mask takes one byte per value of space in RAM. If your data can fit in RAM, there are chances that also a boolean mask will fit.

You can accumulate the selection in a single boolean selection mask using in-place operations. In this way you can apply an arbitrary number of logical operations with a fixed RAM requirement of one selection mask.

To perform in-place boolean operation you can use Numpy Logic Functions that provide an out parameter. For example:

# (selection1  _AND_  original_array) > 3
mask = selection1(original_array)
mask = np.logical_and(mask, original_array, out=mask)
mask = np.greater(mask, 3, out=mask)

You can also perform in-place operations using infix operators (+= or *=, OR and AND respectively).

If you don't have enough RAM to create even a single boolean mask, and the number of selected elements is low, you can resort to select elements by index number. For example numpy.nonzero() returns the index of every non-zero element.

Finally, if your data does not fit in RAM then you can use pytables. This allows you to save and load data in slices. Pytables not only provides you very fast IO operations but also perform very fast (complex) queries on the on-disk data-set with a single command (see pytable documentation: Expr module). However pytables can be a bit intimidating at start. So I don't suggest using it if you are of faint heart (and you don't absolutely need it).

user2304916
  • 7,882
  • 5
  • 39
  • 53