0

So there is a csv file I'm reading where I'm focusing on col3 where the rows have the values of different lengths where initially it was being read as a type str but was fixed using pd.eval.

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})


row e.g. [0, 100, -200, 300, -150...]

There are many rows of different sizes and I want to calculate the element wise average, where I have followed this solution. I first ran into the Numpy VisibleDeprecationWarning error which I fixed using this. But for the last step of the solution using np.nanmean I'm running into a new error which is

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

My code looks like this so far:

import pandas as pd
import numpy as np
import itertools 

df = pd.read_csv('datafile.csv', converters={'col3': pd.eval})

datafile = df[(df['col1'] == 'Red') & (df['col2'] == Name) & ((df['col4'] == 'EX') | (df['col5'] == 'EX'))]
   
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning) 
ar = np.array(list(itertools.zip_longest(df['col3'], fillvalue=np.nan)))
print(ar)
np.nanmean(ar,axis=1)

the arrays print like this enter image description here

And the error is pointing towards the last line enter image description here

The error I can see if pointing towards the arrays being of type object but I'm not sure how to fix it.

ursula
  • 15
  • 7
  • The warning that you choose to ignore is telling you that you have a 'ragged array', that will be `object` dtype. It is not a normal multidimensional array; Check the shape; it is probably 1d. `np.nanmean` works on a float array, replacing the `nan` with 0s. It can't operate on your array. – hpaulj Jan 22 '23 at 19:36
  • Despite your use of `zip_longest`, it looks like your element arrays differ in length. Try `[a.shape for a in ar]` to see if that's true. Ignoring the warning does not force it to make a numeric dtype array. The warning tells you to explicitly specify `dtype=object`. – hpaulj Jan 22 '23 at 19:38
  • Checked the shape using len(a) for a in ar as shape doesn't work as it's a tuple and it was all 1 – ursula Jan 22 '23 at 19:46
  • How would I create a float array? Do I have to change the way I read my csv file or is it something I add after – ursula Jan 22 '23 at 19:47

1 Answers1

1

Make a ragged array:

In [23]: arr = np.array([np.arange(5), np.ones(5),np.zeros(3)],object)
In [24]: arr
Out[24]: 
array([array([0, 1, 2, 3, 4]), array([1., 1., 1., 1., 1.]),
       array([0., 0., 0.])], dtype=object)

Note the shape and dtype.

Try to use mean on it:

In [25]: np.mean(arr)
Traceback (most recent call last):
  Input In [25] in <cell line: 1>
    np.mean(arr)
  File <__array_function__ internals>:180 in mean
  File /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:3432 in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File /usr/local/lib/python3.10/dist-packages/numpy/core/_methods.py:180 in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
ValueError: operands could not be broadcast together with shapes (5,) (3,) 

Apply mean to each element array works:

In [26]: [np.mean(a) for a in arr]
Out[26]: [2.0, 1.0, 0.0]

Trying to use zip_longest:

In [27]: import itertools
In [28]: list(itertools.zip_longest(arr))
Out[28]: 
[(array([0, 1, 2, 3, 4]),),
 (array([1., 1., 1., 1., 1.]),),
 (array([0., 0., 0.]),)]

No change. We can use it by unpacking the arr - but it has padded the arrays in the wrong way:

In [29]: list(itertools.zip_longest(*arr))
Out[29]: [(0, 1.0, 0.0), (1, 1.0, 0.0), (2, 1.0, 0.0), (3, 1.0, None), (4, 1.0, None)]

zip_longest can be used to pad lists, but it takes more thought than this.

If we make an array from that list:

In [35]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan)))
Out[35]: 
array([[ 0.,  1.,  0.],
       [ 1.,  1.,  0.],
       [ 2.,  1.,  0.],
       [ 3.,  1., nan],
       [ 4.,  1., nan]])

and transpose it, we can take the nanmean:

In [39]: np.array(list(itertools.zip_longest(*arr,fillvalue=np.nan))).T
Out[39]: 
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 0.,  0.,  0., nan, nan]])
In [40]: np.nanmean(_, axis=1)
Out[40]: array([2., 1., 0.])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the help and the very thorough explanation. Was confused because the values didn't match up to watch I had in excel but it was because I transposed it. If I skip the transposition part I'm getting what I want to achieve since I want to get the average by comparing the first element of all arrays and so forth – ursula Jan 22 '23 at 20:53