Allowing element-wise mean of a list containing different lengths lists

Question

I have a code where there is a list X appends multiple lists of different lengths. For instance: the final value of X after a run can look like this:

X = [[0.6904056370258331, 0.6844439387321473, 0.668782365322113], 
     [0.7253621816635132, 0.6941058218479157, 0.6929935097694397, 0.6919471859931946, 0.6905447959899902]]

As you can see, X[0] is of length = 3 while X[1] is of length = 5. I want to do an element-wise (column-wise) mean of X to generate a single 1D mean of X. If I try np.mean(X, axis=0) it raises error as both X[0] and X[1] are of different lengths. Is there a way to achieve what I am looking for, i.e., a single 1D mean of X?

Thank you,

The problem is that Numpy does not work with non-rectangular lists. It simply here sees a 1d array with objects as elements. — Willem Van Onsem, Jul 08 '19 at 19:22
Does https://stackoverflow.com/questions/44301429/how-to-use-numpy-to-calculate-mean-and-standard-deviation-of-an-irregular-shaped help ? — Devesh Kumar Singh, Jul 08 '19 at 19:23
@DeveshKumarSingh Thank you for your help> I looked at it before posting this question as it doesn't help. — Katherine, Jul 08 '19 at 19:25
@WillemVanOnsem Do you think it makes sense to pad zeros to the smaller list? I am not sure to be honest — Katherine, Jul 08 '19 at 19:26
@Katherine: no, since that would mean you will alter the average. — Willem Van Onsem, Jul 08 '19 at 19:26
Then how about `from statistics import mean print([mean(lst) for lst in X])` which gives you `[0.6812106470266978, 0.6989906990528106]` — Devesh Kumar Singh, Jul 08 '19 at 19:26
Do you want one number, the mean of all values, or a mean for each sublist? — hpaulj, Jul 08 '19 at 19:27
@DeveshKumarSingh This takes means of each sublist. I want the mean of each column in the whole X but the first row is shorter so I thought of padding zero or NaN — Katherine, Jul 08 '19 at 19:31
For this case a list comprehension is probably best. Alternatively you could use pandas: `pd.DataFrame(X).mean(1)`. It will pad with `nan` for you and ignore them in the mean calculation — Brenlla, Jul 08 '19 at 19:31
If you are open to padding with NaNs or zeros, you can use - https://stackoverflow.com/questions/40569220/ and then use `sum` or `mean` along relevant axis. — Divakar, Jul 08 '19 at 19:33
I"m tempted to say there's not such thing as columns in your list. But you could transform it into a different list of lists. `itertools.zip_longest` is perhaps the handiest tool for doing that. — hpaulj, Jul 08 '19 at 19:34

hpaulj · Answer 1 · 2019-07-09T00:46:06.473

To do 'column' calculations we need to change this into a list of the columns.

In [475]: X = [[0.6904056370258331, 0.6844439387321473, 0.668782365322113],  
     ...:      [0.7253621816635132, 0.6941058218479157, 0.6929935097694397, 0.6919471859931946, 0.6905447959899902]]

zip_longest is a handy tool for 'transposing' irregular lists:

In [476]: import itertools                                                                                   
In [477]: T = list(itertools.zip_longest(*X, fillvalue=np.nan))                                              
In [478]: T                                                                                                  
Out[478]: 
[(0.6904056370258331, 0.7253621816635132),
 (0.6844439387321473, 0.6941058218479157),
 (0.668782365322113, 0.6929935097694397),
 (nan, 0.6919471859931946),
 (nan, 0.6905447959899902)]

I chose np.nan as the fill because I can then use np.nanmean to take the mean, while ignoring the nan.

In [479]: [np.nanmean(i) for i in T]                                                                         
Out[479]: 
[0.7078839093446732,
 0.6892748802900315,
 0.6808879375457764,
 0.6919471859931946,
 0.6905447959899902]

For something like np.sum I could fill will 0's, but mean is the sum divided by the count.

Or without nanmean, fill with something that's easy to filter out:

In [480]: T = list(itertools.zip_longest(*X, fillvalue=None)) 
In [483]: [np.mean([i for i in row if i is not None]) for row in T]                                          
Out[483]: 
[0.7078839093446732,
 0.6892748802900315,
 0.6808879375457764,
 0.6919471859931946,
 0.6905447959899902]

zip_longest isn't the only one, but it's reasonably fast, and easy to remember and use.

A Roebel · Answer 2 · 2019-07-08T20:32:35.440

How about this

first determine the maximum row length, then fill all rows to the same length with nans and the use nanmean with axis=0 as in the question.

import numpy as np
X = [[0.6904056370258331, 0.6844439387321473, 0.668782365322113], 
     [0.7253621816635132, 0.6941058218479157, 0.6929935097694397, 0.6919471859931946, 0.6905447959899902]]

max_row_len=max([len(ll) for ll in X])

cm=np.nanmean([[el for el in row ] + [np.NaN] * max(0, max_row_len-len(row))  for row in X], axis=0)

print(cm)

will display

[0.70788391 0.68927488 0.68088794 0.69194719 0.6905448 ]

Allowing element-wise mean of a list containing different lengths lists

2 Answers2