shape of the data is changed after merging in python

Question

I'm having two data frame having the shape df1(220545, 3) and for df2(462, 11). In this the intersection column is campaign ID, df1 has 63 unique id and df2 has 147 id's. while I tried to merge the df2 with df1 on campaign Id, the dimension is changing for (248949, 13). I want df1 shape for the new merged data frame also.

The code which I have used for the merging is

m=df1.merge(df2, on=['Campaign ID'],how='inner')
m=m.drop_duplicates(keep=False)

Please give some suggestion to get exact output.

Looks like there are duplicate rows in either/both `df1` and `df2`. This is why `merge` is producing more number of rows. Not sure what you mean by `I want df1 shape for the new merged data frame also`. — Mayank Porwal, Nov 29 '18 at 12:20

score 0 · Answer 1 · answered Nov 29 '18 at 12:40

If you want to keep the left Dataframe you have to do an left join (how='left'). Than you have to drop the possible duplicates on an unique row like idx. Otherwise with a inner join like you did you get the following dataframe:

df1 = pd.DataFrame([(f't{x}', f'a{x}') for x in range(0,5)], columns=['idx', 'a'])

    idx a
0   t0  a0
1   t1  a1
2   t2  a2
3   t3  a3
4   t4  a4

df2 = pd.DataFrame([(f't{x%2}', f'b{x}') for x in range(0,5)], columns=['idx', 'b'])

    idx b
0   t0  b0
1   t1  b1
2   t0  b2
3   t1  b3
4   t0  b4

df_result = pd.merge(df1, df2, on='idx)

    idx a   b
0   t0  a0  b0
1   t0  a0  b2
2   t0  a0  b4
3   t1  a1  b1
4   t1  a1  b3

If you do only drop_duplicates() this won't work because you do not have rows which are completly duplicated as in the example shown.

But is not giving the same shape of the df1 – Guru Prasanna Dec 03 '18 at 04:48 — Guru Prasanna, Dec 03 '18 at 04:48

score 0 · Answer 2 · answered Nov 29 '18 at 12:49

I assume when you say,

"I want df1 shape for the new merged data frame also."

you mean you want a dataframe with the same number of rows and columns as df1. If that is the case, why do you want to join to df2 if you aren't bringing in any new columns from df2?

Unless otherwise specified, in the merge statement all columns will be brought in from df1 and df2 in the resultant df.

So you can explicitly specify the columns you want to bring in from df1 and df2 as:

m=df1[['col1', 'col2', ....]].merge(df2[['col3','col4', ...]], on=['Campaign ID'],how='inner')

Which will bring in the columns you want since you specify the column names ('col1', 'col2') explicitly.

As for keeping the number of rows the same, since the campaign ID is not unique in either table you will have a Cartesian product, which means that one instance of a particular campaign ID in df1 could link to multiple rows in df2. If you want the number of rows to be the same you must ensure that campaign ID in df1 only matches to one instance of Campaign ID in df2. Also once you are sure that one Campaign ID in df1 only matches one Campaign ID in df2 you want to use a left join (not inner) to ensure that lines that have no Campaign in df2 are not lost.

Yes, you are right but I only want the CamaignID which is in df1 are to be merged with df2 and the new dataframe "m" want to contains the CampaignID which is present in df1 only — Guru Prasanna, Dec 03 '18 at 04:44
In that case you need to check the duplicates and use a left join as in @MisterMonk's answer above. — rich, Dec 03 '18 at 09:45

shape of the data is changed after merging in python

2 Answers2