Concat python dataframes, drop duplicate and keep record of rows origin

Question

I have multiple dataframes, example data:

df1:

user_id    username firstname lastname 
 123         abc      abc       abc
 456         def      def       def 
 789         ghi      ghi       ghi

df2:

user_id     username  firstname lastname
 111         xyz       xyz       xyz
 456         def       def       def
 234         mnp       mnp       mnp

df3:

user_id     username  firstname lastname
 789         ghi       ghi       ghi       
 456         def       def       def
 222         uwv       uwv       uwv

I want to concatenate those dataframe, drop duplicate rows, and keep track of the rows' origins by adding more columns. Desired output:

df1:

user_id    username firstname lastname df1 df2 df3
 123         abc       abc       abc    1   0   0
 456         def       def       def    1   1   1
 789         ghi       ghi       ghi    1   0   1
 111         xyz       xyz       xyz    0   1   0
 234         mnp       mnp       mnp    0   1   0
 222         uwv       uwv       uwv    0   0   1

I can concatenate do the first step by using:

pd.concat([df1, df2, df3]).drop_duplicates('user_id').reset_index(drop=True)

How can I do the last step (making the origin columns)? I don't know how to do this without forloop, which is not practical for big dataframe. Thank you

shaik moeed · Accepted Answer · 2020-03-02T09:07:17.913

As you need the details of df's names after concatenating, you should not drop them.

You need to add df name as a column, so that after concatenating, we can know, which row is from which df.

>>> cols = df1.columns.to_list() # Store columns before adding `origin` column to use later in groupby
>>> df1['origin'] = 'df1'
>>> df2['origin'] = 'df2'
>>> df3['origin'] = 'df3'

Now, using groupby your task becomes easy.

>>> df = pd.concat([df1, df2, df3])\
    .reset_index(drop=True)\
    .groupby(by = cols, axis=0)\
    .apply(lambda x: x['origin'].values)\
    .to_frame('origin')\
    .reset_index()

Output:

>>> df

   user_id username firstname lastname           origin
0      111      xyz       xyz      xyz            [df2]
1      123      abc       abc      abc            [df1]
2      222      uwv       uwv      uwv            [df3]
3      234      mnp       mnp      mnp            [df2]
4      456      def       def      def  [df1, df2, df3]
5      789      ghi       ghi      ghi       [df1, df3]

You need to do one hot encoding to get the expected results like, (referenced from here)

Using sklearn.preprocessing.MultiLabelBinarizer (You can also use pd.get_dummies as well, check this)

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> expandedLabelData = mlb.fit_transform(df["origin"])
>>> labelClasses = mlb.classes_
>>> encoded_df = pd.DataFrame(expandedLabelData, columns=labelClasses)
   df1  df2  df3
0    0    1    0
1    1    0    0
2    0    0    1
3    0    1    0
4    1    1    1
5    1    0    1

Finally,

>>> pd.concat([df.drop('origin', axis=1), encoded_df], axis=1)
   user_id username firstname lastname  df1  df2  df3
0      111      xyz       xyz      xyz    0    1    0
1      123      abc       abc      abc    1    0    0
2      222      uwv       uwv      uwv    0    0    1
3      234      mnp       mnp      mnp    0    1    0
4      456      def       def      def    1    1    1
5      789      ghi       ghi      ghi    1    0    1

In the line `groupby(by = ['user_id', 'username', 'firstname', 'lastname'], axis=0)` I replace it with `groupby(by = df1.columns, axis=0)` but I got `ValueError: Grouper and axis must be same length` — Thanh Nguyen, Mar 02 '20 at 09:01
@ThanhNguyen Correct, you need to remove `origin` from `df1.columns` — shaik moeed, Mar 02 '20 at 09:03
@ThanhNguyen Check my updated answer to pass column names as variable. — shaik moeed, Mar 02 '20 at 09:08

score 1 · Answer 2 · answered Mar 02 '20 at 07:33

1

Create dictionary of DataFrames, so if use concat is created MultiIndex, convert it to column by DataFrame.reset_index and grouping by all columns with join, so last is possible use Series.str.get_dummies with reset_index:

dfs = {'df1': df1, 'df2': df2, 'df3': df3}
df = (pd.concat(dfs)
       .reset_index()
       .groupby(df1.columns.tolist())['level_0']
       .apply('|'.join)
       .str.get_dummies()
       .reset_index())
print (df)
   user_id username firstname lastname  df1  df2  df3
0      111      xyz       xyz      xyz    0    1    0
1      123      abc       abc      abc    1    0    0
2      222      uwv       uwv      uwv    0    0    1
3      234      mnp       mnp      mnp    0    1    0
4      456      def       def      def    1    1    1
5      789      ghi       ghi      ghi    1    0    1

answered Mar 02 '20 at 07:33

jezrael

822,522
95
1,334
1,252

Thank you for a great answer (as you always do), but this time shaik's answer helped me understand how to apply to other problems as well, so I will accept his answer for this question. – Thanh Nguyen Mar 06 '20 at 08:01
Yeah I think that both answers has similar principle, it's quite hard for me to accept one. Performance wise, your answer is better. I also agree that using MLB is a bit overkill, but somehow I feel it's also interesting. Thank you again, it's always fascinating to see your answer – Thanh Nguyen Mar 06 '20 at 08:09
1

@ThanhNguyen - Ya, understand, not easy decision ;) Happy coding ;) – jezrael Mar 06 '20 at 08:09

Concat python dataframes, drop duplicate and keep record of rows origin

2 Answers2