Python Pandas Create unique dataframe out of many lists

Question

Hi I want to create a dataframe that stores a unique variable and its average in every column. Currently I have a dataframe that has 2 columns. One has a list of names while the other has a single value. I want to associate that value with all of the names in the list and eventually find the average value for all the names This is the data I have:

Df1:
names_col                    cost_col
[milk, eggs, cookies]          3
[water, milk, yogurt]          5 
[cookies, diaper, yogurt]      7

This is what I want:

Df2:
names_col             avg_cost_col
milk                       4
eggs                       3
cookies                    5
water                      5
yogurt                     6
diaper                     7

I thought about doing an apply over all the rows somehow or using set() to remove duplicates out of every list but I am not sure. Any help would be appreicated

score 1 · Answer 1 · answered Aug 03 '18 at 02:30

1

IIUC flatten your list (unnest)

pd.DataFrame(data=df.cost_col.repeat(df.names_col.str.len()).values,index=np.concatenate(df.names_col.values)).mean(level=0)
Out[221]: 
         0
milk     4
eggs     3
cookies  5
water    5
yogurt   6
diaper   7

answered Aug 03 '18 at 02:30

BENY

317,841
20
164
234

Sylvain · Answer 2 · 2019-09-18T07:48:02.547

1

Set() would be the easiest answer to remove duplicates.
The notation *list_name extends/unpack your list.

l = [["milk", "eggs", "cookies"], 
    ["water", "milk", "yogurt"], 
    ["cookies", "diaper", "yogurt"]]
c = set([*l[0], *l[1], *l[2]])
{'cookies', 'eggs', 'diaper', 'yogurt', 'water', 'milk'}

And use it for your pandas columns names. You can convert it back to a list if needed with list()

edited Sep 18 '19 at 07:48

answered Aug 03 '18 at 02:38

Sylvain

679
9
13

This is not an answer to the question the OP want's to use pandas and this doesn't answer the whole question – U13-Forward Aug 03 '18 at 02:41

score 1 · Accepted Answer · answered Aug 03 '18 at 02:40

Following @MaxU's suggestion of vectorized solution in this answer

ndf = pd.DataFrame({'a':np.repeat(df.cost_col.values, df.names_col.str.len()),
                    'b':np.concatenate(df.names_col.values)})

yields

    a   b
0   3   milk
1   3   eggs
2   3   cookies
3   5   water
4   5   milk
5   5   yogurt
6   7   cookies
7   7   diaper
8   7   yogurt

Then just plain groupy+mean

ndf.groupby('b').mean()

        a
b   
cookies 5
diaper  7
eggs    3
milk    4
water   5
yogurt  6

Python Pandas Create unique dataframe out of many lists

3 Answers3