I have a huge file with the first row as a string and others rows represent integers. The number of columns is variable, dependent on a row.
A have one global list where I save my intermediate results. arr_of_arr
is an list of lists of floats. The length of arr_of_arr
is about 5000. Each element (again an array) of this array has a length from 100.000 to 10.000.000. The maximum length of the elements might vary, so I cannot extend each element in advance when I create arr_of_arr.
After I have proceeded the whole file,I add artificially I compute the mean over the elements for each of the global array and plot.max_size_arr
is a length of the longest element in an array ( I compete it when I iterate over the lines in file)
arr = [x+[0]*(max_size_arr - len(x)) for x in arr_of_arr]
I need to calculate the means across arrays element wise. For example, [[1,2,3],[4,5,6],[0,2,10]] would result in [5/3,9/3,19/3] (mean of the first elements across arrays, mean od second elements across arrays etc.)
arr = np.mean(np.array(arr),axis=0)
However, this result in a huge memory consumption (like 100GB according to the cluster information). What would be a good solution in sense of structure to reduce memory consumption? Would numpy arrays be lighter than the normal arrays in python?