pandas: merge two files based on multiple columns

Question

I have two files which have 3 common columns - Date, KeywordId, AdGroupId. I want to merge these two files based on these columns such that for each row with a particular date, keywordid, adgroupid in the first file, if there's corresponding row with same date, keywordid, adgroupid in the second file, then append the rest of the values present in the second file and if not just append null or - in the rest of the columns.

The first file (df1 here) has 59,00,000 rows. The second file has around 10,00,000 rows. I used the code below

 import pandas as pd

df1 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\cumulative_adwords_test.csv")
df2 = pd.read_csv(r"C:\Users\Rakshit Lal\Desktop\QVC Data\psnb_extract_daily\Final\Test_psnbfull.csv")

# Merge the two dataframes, using _ID column as key
df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'])
df3.set_index('Date', inplace = True)

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

# Write it to a new CSV file
df3.to_csv('CSV3.csv')

My final file - csv3 contains only 6,05,277 rows for some reason where it should contain 59,00,000 rows (as in file 1). I believe I'm making an error with the way I'm using the merge function. Can someone help me out on where I'm going wrong and what I should modify?

Perhaps you want to set the `how` keyword argument to `'outer'`? — 9769953, Jul 27 '20 at 14:13
Is it guaranteed that each row in df2 matches one or more rows in df1? That is how I read your question, but it's not entirely clear. — 9769953, Jul 27 '20 at 14:13
For more information, have a read through https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html . The figures may give you an idea of what to use in which case. — 9769953, Jul 27 '20 at 14:14
No. For any particular row in df2 there might not be a corresponding row in df1. But all I want is to have all the rows in df1 intact in my final csv and if there's a corresponding entry in df1, then add the values for those rows and if there isn't then just leave those extra columns for those rows blank or null — Rakshit, Jul 27 '20 at 14:30
Then `'left'` is indeed the better choice compared to `'outer'`. — 9769953, Jul 27 '20 at 16:26

score 2 · Accepted Answer · answered Jul 27 '20 at 14:14

2

If you don't specify how to merge it takes a inner join - but actually you want to do left join. You can use 'left' merge:

df3 = pd.merge(df1, df2, on = ['Date', 'KeywordId', 'AdGroupId'], how = 'left')

answered Jul 27 '20 at 14:14

blutab

181
8

I'll try this out and let you know if it works! Thanks. – Rakshit Jul 27 '20 at 14:36
Any idea on what I should include to have the rows sorted by the date column (descending)? – Rakshit Jul 28 '20 at 04:43
This could help you: https://stackoverflow.com/questions/28161356/sort-pandas-dataframe-by-date – blutab Jul 28 '20 at 05:49

pandas: merge two files based on multiple columns

1 Answers1