3

I have a huge file with the first row as a string and others rows represent integers. The number of columns is variable, dependent on a row. A have one global list where I save my intermediate results. arr_of_arr is an list of lists of floats. The length of arr_of_arr is about 5000. Each element (again an array) of this array has a length from 100.000 to 10.000.000. The maximum length of the elements might vary, so I cannot extend each element in advance when I create arr_of_arr.

After I have proceeded the whole file,I add artificially I compute the mean over the elements for each of the global array and plot.max_size_arr is a length of the longest element in an array ( I compete it when I iterate over the lines in file)

arr = [x+[0]*(max_size_arr - len(x)) for x in arr_of_arr]

I need to calculate the means across arrays element wise. For example, [[1,2,3],[4,5,6],[0,2,10]] would result in [5/3,9/3,19/3] (mean of the first elements across arrays, mean od second elements across arrays etc.)

arr = np.mean(np.array(arr),axis=0)

However, this result in a huge memory consumption (like 100GB according to the cluster information). What would be a good solution in sense of structure to reduce memory consumption? Would numpy arrays be lighter than the normal arrays in python?

Alina
  • 2,191
  • 3
  • 33
  • 68
  • you need to change the way you are accessing the elements, store the first element of each array in its own array e.g. [[1,4,0],[2,5,2] etc.. it sounds to me like your memory is trashing by accessing the elements out of order – scrineym Dec 14 '15 at 13:56
  • Another solution would be the use of yielding, ([see the post about yielding](http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python)). – Damian Dec 14 '15 at 14:26
  • `arr_of_arr` should be a list. Making it an object type array just complicates matters. – hpaulj Dec 14 '15 at 16:04
  • What do you mean by `normal arrays`? Python has lists. – hpaulj Dec 14 '15 at 16:08
  • @scrineym I do not access the elements out of order. I can use only 100GB in cluster and my code with the array_of_array exceeds that. – Alina Dec 14 '15 at 17:04
  • @hpaulj it is a list, sorry – Alina Dec 14 '15 at 17:05

3 Answers3

1

I think that the huge memory consumption is because you want to have the whole array in memory at once.

Why don't you use slices combined with numpy arrays?. Doing that you can simulate a batch processing of your data. You can give to a function the batch size (1000 or 10000 arrays), calculate the means and write the results into a dict or a file indicating slices and its own respectives means.

Community
  • 1
  • 1
pazitos10
  • 1,641
  • 16
  • 25
  • I need to calculate the mean column-wise, so, read firstly all rows (some rows I need to extend by additional columns because rows can have different number of columns) and then I calculate the mean. To calculate the mean of the first 1000 rows and then of the next 1000 rows is not appropriate for me. – Alina Dec 14 '15 at 17:02
  • But you can slice over [columns](http://stackoverflow.com/questions/8386675/extracting-specific-columns-in-numpy-array) too. – pazitos10 Dec 14 '15 at 22:38
0

Have you tried using the Numba package? It reduces computation time and memory usage with standard numpy arrays. http://numba.pydata.org

Kunal Mathur
  • 263
  • 3
  • 4
0

If the lines vary widely in the number of values, I'd stick with a list of lists, at long as is practical. numpy arrays are best when the data is 'row' lengths are the same.

To illustrate with a small example:

In [453]: list_of_lists=[[1,2,3],[4,5,6,7,8,9],[0],[1,2]]

In [454]: list_of_lists
Out[454]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [0], [1, 2]]

In [455]: [len(x) for x in list_of_lists]
Out[455]: [3, 6, 1, 2]

In [456]: [sum(x) for x in list_of_lists]
Out[456]: [6, 39, 0, 3]

In [458]: [sum(x)/float(len(x)) for x in list_of_lists]
Out[458]: [2.0, 6.5, 0.0, 1.5]

With your array approach to taking the mean, I get different numbers - because of all the padding 0s. Is that intentional?

In [460]: max_len=6

In [461]: arr=[x+[0]*(max_len-len(x)) for x in list_of_lists]

In [462]: arr
Out[462]: 
[[1, 2, 3, 0, 0, 0],
 [4, 5, 6, 7, 8, 9],
 [0, 0, 0, 0, 0, 0],
 [1, 2, 0, 0, 0, 0]]

mean along columns?

In [463]: np.mean(np.array(arr),axis=0)
Out[463]: array([ 1.5 ,  2.25,  2.25,  1.75,  2.  ,  2.25])

mean along rows

In [476]:     In [463]: np.mean(np.array(arr),axis=1)
Out[476]: array([ 1. ,  6.5,  0. ,  0.5])

list mean with max length:

In [477]: [sum(x)/float(max_len) for x in list_of_lists]
Out[477]: [1.0, 6.5, 0.0, 0.5]
hpaulj
  • 221,503
  • 14
  • 230
  • 353