Merge Multiple CSV with different column name but same definition

Question

I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.

So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.

Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..

File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..

i want this to look like

id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...

I can't think of an elegant way to do this, would be great to get some direction or help with this.

Thanks

If the schema is identical and the order is maintained, just assign a new list of headers to `df.columns`? — cs95, Apr 15 '19 at 23:04
You'd then be able to use `pd.merge` or `pd.concat`, see the "merging multiple DataFrames" section in [this post...](https://stackoverflow.com/a/53645883/4909087). — cs95, Apr 15 '19 at 23:05
schema is identical but the order changes from one file to the other. I think the post you pointed me to might work, checking if i can make it work. Thanks — juggernaut, Apr 15 '19 at 23:07

David Waterworth · Answer 1 · 2019-04-16T01:46:09.980

Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.

import pandas as pd

pd.concat([
    pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
    pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)

Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.

This requires mapping files to handlers somehow

i.e.

file1.csv

id, name, total_amount
1, "test", 123

file2.csv

member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"

The following selects the common 3 columns and renames / reorders.

import pandas as pd

pd.concat([
    pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
    pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
    sort=False
)

Edit 2:

And a nice way to apply this is to use lambda's and maps - i.e.

parsers = {
   "schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
   "schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3]) 
}

map = {
    "file2.csv": "schema2",
    "file1.csv": "schema1"}

pd.concat([parsers[v](k) for k,v in map.items()], sort=False)

how would this work if my columns are not not ordered between different sheets — juggernaut, Apr 16 '19 at 00:51
It wouldn't. Your example has them in the same order - if that's not the case you should probably update the question to make it clearer that the columns aren't in the same order in different files. — David Waterworth, Apr 16 '19 at 01:06
I've updated the answer, I don't think there's a really elegent way to do this though — David Waterworth, Apr 16 '19 at 01:30
I agree, i think this is the cleanest solution i have seen so far. Appreciate your help. Thanks — juggernaut, Apr 16 '19 at 05:20

juggernaut · Accepted Answer · 2019-04-20T13:34:45.793

This is what i ended up doing and found to be the cleanest solution. Thanks David your help.

dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
   "schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
   "schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())) 
}      
map = {
    'schema1': (a_file.csv,dict1),
    'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
    d = parsers[k](v[0], v[1])
    d.rename(columns=v[1], inplace=True)
    total.append(d)
final_df = pd.concat(total, sort=False)

Merge Multiple CSV with different column name but same definition

2 Answers2