Nested lists in cells of pandas dataframe, how to extract?

Question

I'm having an issue where I have nested lists in multiple columns of my dataframe. Image below for reference -

df_final['Product Name'] = ('dr. jart+ mask heroes face savers', 
                       'moon juice beauty shroom exfoliating acid 
                        potion',
                       'laneige kiss and make up set')

df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty', 
                   'AsianBeautyAdvice','SkincareAddiction', 
                   'abdiscussion'])

df_final['Times Mentioned'] = (None, [1], [4,1,1,1])

I want the cells in the 'Subreddit' column to become the new columns, and the cells in the 'Times Mentioned' column to fill in the new cells. The 'Product Name' column will be the new index.

I tried using pivot_table -

from pandas import pivot_table

table = pivot_table(df_final, values='Times Mentioned', index=['Product 
        Name'], columns=['Subreddit'], aggfunc='size')

This successfully turned all the nested lists of the 'Subreddit' column into new columns, but the 'Times Mentioned' just repeats the first number for every column (example below)

wrong cell fillers

This should be 4, 1, 1, 1 like in the original image. Does anyone know how to fix this?

Thank you in advance!

Please post code and example data as text instead of images, so we can copy-paste and experiment — Peter Leimbigler, Oct 22 '18 at 00:54
@PeterLeimbigler, code has been added! let me know if anything else is needed. — daisy_cheeks, Oct 22 '18 at 01:19

edesz · Answer 1 · 2018-10-22T04:01:53.663

There are some cells in the DF that contain a list

['AsianBeauty', 'AsianBeautyAdvice','SkincareAddiction', 'abdiscussion']

and this is a single cell which needs to be exploded into separate rows of the same column (Product Name). However, this has to be done while preserving the association between the Product Name column and each of the other 2 columns (which contain rows that must be exploded). I used this SO post to do exactly this, while keeping the association intact. Here is the approach I used, with comments in the code and top-level explanation shown separately

Raw DF from OP

import pandas as pd


df_final = pd.DataFrame()
df_final['Product Name'] = ('dr. jart+ mask heroes face savers', 
                           'moon juice beauty shroom exfoliating acid potion',
                           'laneige kiss and make up set')

df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty', 
                       'AsianBeautyAdvice','SkincareAddiction', 
                       'abdiscussion'])

df_final['Times Mentioned'] = (None, [1], [4,1,1,1])
print(df_final)

Raw data (df_final)

                                       Product Name                                                          Subreddit Times Mentioned
0                 dr. jart+ mask heroes face savers                                                               None            None
1  moon juice beauty shroom exfoliating acid potion                                                  [scacjdiscussion]             [1]
2                      laneige kiss and make up set  [AsianBeauty, AsianBeautyAdvice, SkincareAddiction, abdiscussion]    [4, 1, 1, 1]

Raw data column dtypes

print(df_final.dtypes)
Product Name       object
Subreddit          object
Times Mentioned    object
dtype: object

Code to explode rows and create final DF

exploded_dfs = []
for _, row in df_final.iterrows():
    if all(row): # if a row does contain all non-None values
        # Put 1st pair of columns into single DF, exploding single
        # cell into multiple rows as needed
        df1 = pd.concat([pd.Series(row['Product Name'], row['Subreddit'][:])])\
                          .reset_index()
        # Assign column names
        df1.columns = ['Subreddit', 'Product Name']
        # Put 2nd pair of columns into single DF, exploding single
        # cell into multiple rows as needed
        df2 = pd.concat([pd.Series(row['Product Name'], row['Times Mentioned'][:])])\
                           .reset_index()
        # Assign column names
        df2.columns = ['Times Mentioned', 'Product Name']
        # Perform INNER JOIN on DFs with exploded row contents
        # & drop duplicated column
        merged = pd.concat([df1, df2], axis=1)
        merged = merged.loc[:,~merged.columns.duplicated()]
        # Swap 1st and 2nd columns
        cols = list(merged)
        cols.insert(0, cols.pop(cols.index('Product Name')))
        merged = merged.loc[:, cols]
    else: # if a row does not contain all non-None values
        # Create single row DF with no changes
        merged = pd.DataFrame(columns=['Product Name', 'Subreddit',
                                      'Times Mentioned'])
        # Append row to DF
        merged.loc[0] = row
    exploded_dfs.append(merged)

# Vertically concatenate DFs in list
print(pd.concat(exploded_dfs, axis=0).reset_index(drop=True))

Here is the output

                                       Product Name          Subreddit Times Mentioned
0                 dr. jart+ mask heroes face savers               None            None
1  moon juice beauty shroom exfoliating acid potion    scacjdiscussion               1
2                      laneige kiss and make up set        AsianBeauty               4
3                      laneige kiss and make up set  AsianBeautyAdvice               1
4                      laneige kiss and make up set  SkincareAddiction               1
5                      laneige kiss and make up set       abdiscussion               1

Brief explanation of steps

iterate over all rows
- note that if the row contains any None values, it will just be taken as is, so it is assumed that this row does not need cleaning: this row will just be appended to a single row DF
for first row in raw DF that does not contain all None
- explode cell from first column with list (Subreddit) into column, if necessary (explained in this question)
- horizontally concatenate exploded cell (now as multiple rows) with row from column without list (Product Name); this gives cleaned DF df1
- repeat last 2 steps above using second column with list (Times Mentioned); this gives cleaned DF df1
- horizontally concatenate 2 cleaned DFs into new DF named merged
- repeat above procedure for all rows in raw DF and append cleaned DF to blank list
- assemble final DF with vertical concatenation of all DFs in list

Package info

pandas==0.23.4

Python version

Python 2.7.15rc1

Thank you so much for this detailed answer! I keep getting the error "TypeError: Index(...) must be called with a collection of some kind, 1 was passed" on the "df1 = pd.concat" line. It should be noted that I do have an additional column in my dataframe (boolean, just 0s and 1s), but I don't see how adding that in would cause this error? — daisy_cheeks, Oct 22 '18 at 03:22
That is strange. I can't seem to duplicate it - I added this line `df_final['test'] = 1` to your code, but it does not affect my output....I still get the code to work. What version of Pandas are you using? I am using `pandas==0.23.4 ` (I'll add this to my answer). I agree that the extra column should not make a difference since it is not being used. Try to delete that extra column and re-run the code. — edesz, Oct 22 '18 at 03:40
Ah, that's probably it. I'm using Python 3.6.5. I'll do some digging to see what this version doesn't like about 2.7.15. Thank you again! — daisy_cheeks, Oct 22 '18 at 03:56
I got it working on `Python 3.6.6 :: Anaconda custom (64-bit)`, also with `pandas==0.23.4`, still without the `TypeError`. I also added the raw DF I am using to my answer, which shows all the columns and their datatypes (`object`), to compare to your dataframe....in case that helps. — edesz, Oct 22 '18 at 04:06

Nested lists in cells of pandas dataframe, how to extract?

1 Answers1