There are some cells in the DF that contain a list
['AsianBeauty', 'AsianBeautyAdvice','SkincareAddiction', 'abdiscussion']
and this is a single cell which needs to be exploded into separate rows of the same column (Product Name
). However, this has to be done while preserving the association between the Product Name
column and each of the other 2 columns (which contain rows that must be exploded). I used this SO post to do exactly this, while keeping the association intact. Here is the approach I used, with comments in the code and top-level explanation shown separately
Raw DF from OP
import pandas as pd
df_final = pd.DataFrame()
df_final['Product Name'] = ('dr. jart+ mask heroes face savers',
'moon juice beauty shroom exfoliating acid potion',
'laneige kiss and make up set')
df_final['Subreddit'] = (None, ['scacjdiscussion'], ['AsianBeauty',
'AsianBeautyAdvice','SkincareAddiction',
'abdiscussion'])
df_final['Times Mentioned'] = (None, [1], [4,1,1,1])
print(df_final)
Raw data (df_final
)
Product Name Subreddit Times Mentioned
0 dr. jart+ mask heroes face savers None None
1 moon juice beauty shroom exfoliating acid potion [scacjdiscussion] [1]
2 laneige kiss and make up set [AsianBeauty, AsianBeautyAdvice, SkincareAddiction, abdiscussion] [4, 1, 1, 1]
Raw data column dtypes
print(df_final.dtypes)
Product Name object
Subreddit object
Times Mentioned object
dtype: object
Code to explode rows and create final DF
exploded_dfs = []
for _, row in df_final.iterrows():
if all(row): # if a row does contain all non-None values
# Put 1st pair of columns into single DF, exploding single
# cell into multiple rows as needed
df1 = pd.concat([pd.Series(row['Product Name'], row['Subreddit'][:])])\
.reset_index()
# Assign column names
df1.columns = ['Subreddit', 'Product Name']
# Put 2nd pair of columns into single DF, exploding single
# cell into multiple rows as needed
df2 = pd.concat([pd.Series(row['Product Name'], row['Times Mentioned'][:])])\
.reset_index()
# Assign column names
df2.columns = ['Times Mentioned', 'Product Name']
# Perform INNER JOIN on DFs with exploded row contents
# & drop duplicated column
merged = pd.concat([df1, df2], axis=1)
merged = merged.loc[:,~merged.columns.duplicated()]
# Swap 1st and 2nd columns
cols = list(merged)
cols.insert(0, cols.pop(cols.index('Product Name')))
merged = merged.loc[:, cols]
else: # if a row does not contain all non-None values
# Create single row DF with no changes
merged = pd.DataFrame(columns=['Product Name', 'Subreddit',
'Times Mentioned'])
# Append row to DF
merged.loc[0] = row
exploded_dfs.append(merged)
# Vertically concatenate DFs in list
print(pd.concat(exploded_dfs, axis=0).reset_index(drop=True))
Here is the output
Product Name Subreddit Times Mentioned
0 dr. jart+ mask heroes face savers None None
1 moon juice beauty shroom exfoliating acid potion scacjdiscussion 1
2 laneige kiss and make up set AsianBeauty 4
3 laneige kiss and make up set AsianBeautyAdvice 1
4 laneige kiss and make up set SkincareAddiction 1
5 laneige kiss and make up set abdiscussion 1
Brief explanation of steps
- iterate over all rows
- note that if the row contains any
None
values, it will just be taken as is, so it is assumed that this row does not need cleaning: this row will just be appended to a single row DF
- for first row in raw DF that does not contain all
None
- explode cell from first column with list (
Subreddit
) into column, if necessary (explained in this question)
- horizontally concatenate exploded cell (now as multiple rows) with row from column without list (
Product Name
); this gives cleaned DF df1
- repeat last 2 steps above using second column with list (
Times Mentioned
); this gives cleaned DF df1
- horizontally concatenate 2 cleaned DFs into new DF named
merged
- repeat above procedure for all rows in raw DF and append cleaned DF to blank list
- assemble final DF with vertical concatenation of all DFs in list
Package info
pandas==0.23.4
Python version
Python 2.7.15rc1