I have a dictionary with an unknown number of pandas dataframes. Each dataframe contains a set of columns that are always present (user_id) and a set of columns that might or may not be present. All dataframes have the same number and order of rows. The content of each cell is a list (for the columns I am interested).
A simplified example:
df['first'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1], [2,3], [3]],
'col2': [[3], [3], [3,1]],
'col3': [[], [1,2,3], [3,1]]} )
df['second'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1, 2], [3], [3]],
'col3': [[1], [2,3], [3]],
'col4': [[3], [3], [3,1]] })
df['last'] = pd.DataFrame( {'user_ID': [1, 2, 3],
'col1': [[1], [2,3], [3]],
'col2': [[3], [3], [3,1]],
'col5': [[], [1,2,3], [3,1]]} )
They look like:
col1 col2 col3 user_ID
0 [1] [3] [] 1
1 [2, 3] [3] [1, 2, 3] 2
2 [3] [3, 1] [3, 1] 3
col1 col3 col4 user_ID
0 [1, 2] [1] [3] 1
1 [3] [2, 3] [3] 2
2 [3] [3] [3, 1] 3
col1 col2 col5 user_ID
0 [1] [3] [] 1
1 [2, 3] [3] [1, 2, 3] 2
2 [3] [3, 1] [3, 1] 3
How can I merge all these dataframes into a single dataframe where all columns that are not user_ID are merged so the contents are appended to the list?
Result should look like (order of elements in each list is irrelevant):
col1 col2 col3 col4 col5 user_ID
0 [1, 1, 2, 1] [3, 3] [1] [3] [] 1
1 [2, 3, 3, 2, 3] [3, 3] [1, 2, 3, 2, 3] [2] [1, 2, 3] 2
2 [3, 3, 3] [3, 1, 3, 1] [3, 1, 3] [3, 1] [3, 1] 3
I managed to concatenate the dataframes, but I still need to merge the resulting columns.
for dfName in ['first', 'second', 'last']:
df[dfName] = df[dfName].drop(['user_ID'], axis=1)
merged = pd.concat(df, axis=1, keys=['first', 'second', 'last'])
print(merged)
outputs:
first second last \
col1 col2 col3 col1 col3 col4 col1 col2
0 [1] [3] [] [1, 2] [1] [3] [1] [3]
1 [2, 3] [3] [1, 2, 3] [3] [2, 3] [3] [2, 3] [3]
2 [3] [3, 1] [3, 1] [3] [3] [3, 1] [3] [3, 1]
col5
0 []
1 [1, 2, 3]
2 [3, 1]
Any ideas?