0

I have a dictionary with an unknown number of pandas dataframes. Each dataframe contains a set of columns that are always present (user_id) and a set of columns that might or may not be present. All dataframes have the same number and order of rows. The content of each cell is a list (for the columns I am interested).

A simplified example:

df['first']     = pd.DataFrame(  {'user_ID': [1, 2, 3],
                        'col1': [[1], [2,3], [3]],
                        'col2': [[3], [3], [3,1]],
                        'col3': [[], [1,2,3], [3,1]]} )

df['second']    = pd.DataFrame(  {'user_ID': [1, 2, 3],
                        'col1': [[1, 2], [3], [3]],
                        'col3': [[1], [2,3], [3]],
                        'col4': [[3], [3], [3,1]] })

df['last']      = pd.DataFrame(  {'user_ID': [1, 2, 3],
                        'col1': [[1], [2,3], [3]],
                        'col2': [[3], [3], [3,1]],
                        'col5': [[], [1,2,3], [3,1]]} )

They look like:

     col1    col2       col3  user_ID
0     [1]     [3]         []        1
1  [2, 3]     [3]  [1, 2, 3]        2
2     [3]  [3, 1]     [3, 1]        3

     col1    col3    col4  user_ID
0  [1, 2]     [1]     [3]        1
1     [3]  [2, 3]     [3]        2
2     [3]     [3]  [3, 1]        3

     col1    col2       col5  user_ID
0     [1]     [3]         []        1
1  [2, 3]     [3]  [1, 2, 3]        2
2     [3]  [3, 1]     [3, 1]        3

How can I merge all these dataframes into a single dataframe where all columns that are not user_ID are merged so the contents are appended to the list?

Result should look like (order of elements in each list is irrelevant):

              col1          col2             col3    col4       col5   user_ID
0     [1, 1, 2, 1]        [3, 3]              [1]     [3]         []         1
1  [2, 3, 3, 2, 3]        [3, 3]  [1, 2, 3, 2, 3]     [2]  [1, 2, 3]         2
2        [3, 3, 3]  [3, 1, 3, 1]        [3, 1, 3]  [3, 1]     [3, 1]         3

I managed to concatenate the dataframes, but I still need to merge the resulting columns.

for dfName in ['first', 'second', 'last']:
    df[dfName] = df[dfName].drop(['user_ID'], axis=1)

merged = pd.concat(df, axis=1, keys=['first', 'second', 'last'])
print(merged)

outputs:

    first                     second                    last          \
     col1    col2       col3    col1    col3    col4    col1    col2
0     [1]     [3]         []  [1, 2]     [1]     [3]     [1]     [3]
1  [2, 3]     [3]  [1, 2, 3]     [3]  [2, 3]     [3]  [2, 3]     [3]
2     [3]  [3, 1]     [3, 1]     [3]     [3]  [3, 1]     [3]  [3, 1]


        col5
0         []
1  [1, 2, 3]
2     [3, 1]

Any ideas?

fuyas
  • 129
  • 1
  • 9

2 Answers2

2

It's a little involved, but you will need df.groupby. First, use pd.concat and join them. Then replace NaNs using df.applymap, and finally the groupby and sum.

In [673]: pd.concat([df1, df2, df3], 0)\
            .applymap(lambda x: [] if x != x else x)\
            .groupby('user_ID', as_index=False).sum()
Out[673]: 
   user_ID             col1          col2             col3    col4       col5
0        1     [1, 1, 2, 1]        [3, 3]              [1]     [3]         []
1        2  [2, 3, 3, 2, 3]        [3, 3]  [1, 2, 3, 2, 3]     [3]  [1, 2, 3]
2        3        [3, 3, 3]  [3, 1, 3, 1]        [3, 1, 3]  [3, 1]     [3, 1]

Slightly improved efficiency thanks to Maarten Fabré.


If you have an unknown amount of dataframes, you can put them in a list or dict, and pass that to pd.concat:

merged = pd.concat(df_list, 0). ...
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    nice trick comparing `NaN` to `NaN`. I forgot about that. I think you don't need to concat according to axis 1 and then transpose. You can just concat and then groupby `user_ID` – Maarten Fabré Aug 23 '17 at 11:59
  • I made a small alteration to cope with the fact that I don't know how many dataframes I have in the dict: `merged = pd.concat( df, axis=0, keys=list(df.keys()) ).applymap(lambda x: [] if x != x else x).groupby('user_ID', as_index=False).sum()` And voila! – fuyas Aug 23 '17 at 12:16
  • @fuyas Why a dict? What do the keys and values contain? (asking so I can add it to my answer) – cs95 Aug 23 '17 at 12:17
  • @COLDSPEED, the dataframes are in a dict. Each key is its name and the value is a dataframe extracted from a different source. – fuyas Aug 23 '17 at 12:19
  • 1
    if the dataframes are in a dict, I don't think you need this `keys=list(df_dict.keys())` part – Maarten Fabré Aug 23 '17 at 12:54
1

you could use df.groupby('user_ID').sum() if it weren't for the nan values, which cause all columns apart fro col1 to drop.

To get around this you could use this rather ugly method

pd.concat((df0, df1, df2)).fillna(-1).applymap(lambda x: x if x != -1 else []).groupby('user_ID').sum()

I had to resort to the fillna(-1).applymap(...) because you can't seem to assign [] directly to an item. If someone has a better suggestion to do this, let me know

edit

using @COLDSPEED's trick of comparing NaN to NaN

pd.concat((df0, df1, df2)).applymap(lambda x: x if x == x else []).groupby('user_ID').sum()

works easier

If you want the user_ID as a colum,n instead of an index, just add .reset_index()

Maarten Fabré
  • 6,938
  • 1
  • 17
  • 36