Element wise mean of numpy arrays of different sizes

Question

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})


row e.g. [0, 100, -200, 300, -150...]

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution. I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this. But for the last step of the solution using np.nanmean I'm running into a new error which is

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code looks like this so far:

import pandas as pd
import numpy as np
import itertools 

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})

datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
   
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning) 
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)

the arrays print like this

And the error is pointing towards the last line

The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

The warning that you choose to ignore is telling you that you have a 'ragged array', that will be `object` dtype. It is not a normal multidimensional array; Check the shape; it is probably 1d. `np.nanmean` works on a float array, replacing the `nan` with 0s. It can't operate on your array. — hpaulj, Jan 22 '23 at 19:36
Despite your use of `zip_longest`, it looks like your element arrays differ in length. Try `[a.shape for a in ar]` to see if that's true. Ignoring the warning does not force it to make a numeric dtype array. The warning tells you to explicitly specify `dtype=object`. — hpaulj, Jan 22 '23 at 19:38
Checked the shape using len(a) for a in ar as shape doesn't work as it's a tuple and it was all 1 — ursula, Jan 22 '23 at 19:46
How would I create a float array? Do I have to change the way I read my csv file or is it something I add after — ursula, Jan 22 '23 at 19:47

hpaulj · Accepted Answer · 2023-01-22T19:50:44.367

Make a ragged array:

In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]: 
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
       array([0., 0., 0.])], dtype=object)

Note the shape and dtype.

Try to use mean on it:

In [25]: np.mean(arr)
Traceback (most recent call last):
  Input In [25] in <cell line: 1>
    np.mean(arr)
  File <__array_function__ internals>:180 in mean
  File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,)

Apply mean to each element array works:

In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]

Trying to use zip_longest:

In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]: 
[(array([0, 1, 2, 3, 4]),),
 (array([1., 1., 1., 1., 1.]),),
 (array([0., 0., 0.]),)]

No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:

In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]

zip_longest can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]: 
array([[ 0.,  1.,  0.],
       [ 1.,  1.,  0.],
       [ 2.,  1.,  0.],
       [ 3.,  1., nan],
       [ 4.,  1., nan]])

and transpose it, we can take the nanmean:

In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]: 
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])

Thanks for the help and the very thorough explanation. Was confused because the values didn't match up to watch I had in excel but it was because I transposed it. If I skip the transposition part I'm getting what I want to achieve since I want to get the average by comparing the first element of all arrays and so forth — ursula, Jan 22 '23 at 20:53

Element wise mean of numpy arrays of different sizes

1 Answers1