0

I have two csv files that I want to read in as dataframes with pandas. I want to merge them, with the exception that showing times cannot be duplicated.

If a ShowingDateTime is a duplicate I want to select the row from the first dataframe and not the second. I am not sure the best way to do this with pandas. I want to sort based on Ascending order for ShowingDateTime after.

CSV1:

Address,City,State,ShowingDateTime
1234 Hodge Street,Brown,CA,1/4/17 12:00
9613 Llama Street,Downtown,CA,1/5/17 12:15
7836 Bob Street,Swamp,CA,1/5/17 12:30
2134 Cardinal Street,Ruler,CA,1/6/17 11:30

CSV2:

Address,City,State,ShowingDateTime
10234 Peek Street,Brown,CA,1/4/17 12:00
1122 Kara Street,Downtown,CA,1/5/17 12:30
1023 Solr Street,Swamp,CA,1/6/17 11:30
2234 Tempura Street,Ruler,CA,1/6/17 12:00

Expected Merged Result (written to a csv after df merge):

1234 Hodge Street,Brown,CA,1/4/17 12:00
9613 Llama Street,Downtown,CA,1/5/17 12:15
7836 Bob Street,Swamp,CA,1/5/17 12:30
2134 Cardinal Street,Ruler,CA,1/6/17 11:30
2234 Tempura Street,Ruler,CA,1/6/17 12:00
Defcon
  • 807
  • 3
  • 15
  • 36

3 Answers3

3
  import pandas as pd
  df1 = pd.read_csv('path_of_first_csv_file')
  df2 = pd.read_csv('path_of_second_csv_file')
  df3 = pd.concat([df1, df2], ignore_index=True)
  df3 = df3.drop_duplicates(subset='ShowingDateTime', keep="first")
  print(df3)
  df3.to_csv('output.csv')
Defcon
  • 807
  • 3
  • 15
  • 36
Sai Teja Pakalapati
  • 746
  • 1
  • 11
  • 30
1

You want concat() instead of merge.

First thing would be to load each csv.

df1 = pd.read_csv('csv1.csv')
df2 = pd.read_csv('csv2.csv')

Then concat the two dfs.

final_df = pd.concat([df1,df2],how='outer', ignore_index=True)

Then drop duplicates of ShowingDateTime keeping df1 rows in those situations

final_df.drop_duplicates(subset=['ShowingDateTime'], keep='first')

Then save as csv

final_df.to_csv('final.csv')
CandleWax
  • 2,159
  • 2
  • 28
  • 46
0

I'd recommend searching other questions a bit, you'd find more efficient methods, like here. It outlines how using df.index.duplicated(keep='first') is more performance efficient while handling large datasets, in your case it could be done as follows :

directory = './records/'
all_files = [f for f in os.listdir(directory)]
df = pd.concat((pd.read_csv(directory+f, index_col=3) for f in all_files)) #specify ShowingDateTime as index column
df = df[~df.index.duplicated(keep='first')] #keep only the first index