0

Hi I want to create a dataframe that stores a unique variable and its average in every column. Currently I have a dataframe that has 2 columns. One has a list of names while the other has a single value. I want to associate that value with all of the names in the list and eventually find the average value for all the names This is the data I have:

Df1:
names_col                    cost_col
[milk, eggs, cookies]          3
[water, milk, yogurt]          5 
[cookies, diaper, yogurt]      7

This is what I want:

Df2:
names_col             avg_cost_col
milk                       4
eggs                       3
cookies                    5
water                      5
yogurt                     6
diaper                     7

I thought about doing an apply over all the rows somehow or using set() to remove duplicates out of every list but I am not sure. Any help would be appreicated

Leokins
  • 89
  • 11

3 Answers3

1

IIUC flatten your list (unnest)

pd.DataFrame(data=df.cost_col.repeat(df.names_col.str.len()).values,index=np.concatenate(df.names_col.values)).mean(level=0)
Out[221]: 
         0
milk     4
eggs     3
cookies  5
water    5
yogurt   6
diaper   7
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Set() would be the easiest answer to remove duplicates.
The notation *list_name extends/unpack your list.

l = [["milk", "eggs", "cookies"], 
    ["water", "milk", "yogurt"], 
    ["cookies", "diaper", "yogurt"]]
c = set([*l[0], *l[1], *l[2]])
{'cookies', 'eggs', 'diaper', 'yogurt', 'water', 'milk'}

And use it for your pandas columns names. You can convert it back to a list if needed with list()

Sylvain
  • 679
  • 9
  • 13
1

Following @MaxU's suggestion of vectorized solution in this answer

ndf = pd.DataFrame({'a':np.repeat(df.cost_col.values, df.names_col.str.len()),
                    'b':np.concatenate(df.names_col.values)})

yields

    a   b
0   3   milk
1   3   eggs
2   3   cookies
3   5   water
4   5   milk
5   5   yogurt
6   7   cookies
7   7   diaper
8   7   yogurt

Then just plain groupy+mean

ndf.groupby('b').mean()

        a
b   
cookies 5
diaper  7
eggs    3
milk    4
water   5
yogurt  6
rafaelc
  • 57,686
  • 15
  • 58
  • 82