0

I have a set of CSV files in a folder and I'd like to merge them in one "super-csv". Some of the columns are available in all files, some not.

Fields in the output should just be empty, if it was not available in the source. If a columnname is the same over multiple CSV, it should fill the existing column (Name in the example)

File1.CSV

ID        Name       ContactNo
53        Vikas      9874563210

File2.CSV

ID     Name          Designation
23    MyShore    Software Engineer

Output Expected

ID        Name          ContactNo           Designation 
53        Vikas         9874563210
23        MyShore                          Software Engineer

I've already tried other solutions, but they cannot handle empty fields. eg. merge csv files with different column order remove duplicates

Thanks in advance

Michael

Community
  • 1
  • 1
Michael
  • 289
  • 2
  • 5
  • 14
  • Are the input files sorted by ID? I guess a solution in bash would be cumbersome. In python the best way would probably involve a mapping from a ID to a tuple consisting of the values. After you processed all input files and built your mapping you can use it to create a new file. – Rambo Ramon May 29 '15 at 11:55

1 Answers1

1

In python, you can use the pandas module that allows to fill a dataframe from a csv, merge dataframe and then save the merged dataframe into new csv file.

For example :

import pandas as pd
df1 = pd.DataFrame.from_csv("file1.csv", sep=",")
df2 = pd.DataFrame.from_csv("file2.csv", sep=",")
final_df = df1.reset_index().merge(df2.reset_index(), how="outer").set_index('ID')

final_df.to_csv("result.csv", sep=",")

which would produce

ID,Name,ContactNo,Designation
53,Vikas,9874563210.0, 
23,MyShore,,Software Engineer 

You would have to play with the sep argument to adapt to your files format.

stellasia
  • 5,372
  • 4
  • 23
  • 43
  • Thanks! For two files this is working perfectly! Any solution on how to apply this to more files? (dynamic amount of files in the same folder) – Michael May 29 '15 at 13:19
  • For an unknow number of files, you can use the os module as describe here http://stackoverflow.com/questions/3207219/how-to-list-all-files-of-a-directory-in-python – stellasia May 29 '15 at 13:22
  • yes I know, sorry i may expressed unclear - i meant on how to adapt the "final_df = df1.reset_index().merge(df2.reset_index(), how="outer").set_index('ID')" to more than 2 files? (or dynamically?) shall i create a for i to xx which just adds 1 and 2 , then 3 to the combination of 1 and 2 and so on? – Michael May 29 '15 at 13:33
  • Yes, you can loop on the csv files in your directory and declare before the loop : `final_df = pd.DataFrame(columns=["ID"]) ` – stellasia May 29 '15 at 13:47