1

I have different sources(CSV) of similar data set which i want to merge into single data and write it to my DB. Since data is coming from different sources, they use different headers in their CSV, i want to merge these columns with logical meaning.

So far, i have tried reading all headers first and re reading the files to first get all the data in a single data frame and then doing if else to merge the columns together with same meaning. Ideally I would like to create a mapping file with all possible column names per column and then read CSV using that mapping. The data is not ordered or sorted between files. Number of columns might be different too but they all have the columns i am interested in.

Sample data:
File 1:
id, name, total_amount...
1, "test", 123 ..

File 2:
member_id, tot_amnt, name
2, "test2", 1234 ..

i want this to look like

id, name, total_amount...
1, "test", 123...
2, "test2", 1234...
...

I can't think of an elegant way to do this, would be great to get some direction or help with this.

Thanks

juggernaut
  • 126
  • 18
  • If the schema is identical and the order is maintained, just assign a new list of headers to `df.columns`? – cs95 Apr 15 '19 at 23:04
  • You'd then be able to use `pd.merge` or `pd.concat`, see the "merging multiple DataFrames" section in [this post...](https://stackoverflow.com/a/53645883/4909087). – cs95 Apr 15 '19 at 23:05
  • schema is identical but the order changes from one file to the other. I think the post you pointed me to might work, checking if i can make it work. Thanks – juggernaut Apr 15 '19 at 23:07

2 Answers2

1

Use skiprows and header=None to skip the header, names to specify your own list of column names, and concat to merge into a single df. i.e.

import pandas as pd

pd.concat([
    pd.read_csv('file1.csv',skiprows=1,header=None,names=['a','b','c']),
    pd.read_csv('file2.csv',skiprows=1,header=None,names=['a','b','c'])]
)

Edit: If the different files differ only by column order you can specify different column orders to names and if you want to select a subset of columns use usecols. But you need to do this mapping in advance, either by probing the file, or some other rule.

This requires mapping files to handlers somehow

i.e.

file1.csv

id, name, total_amount
1, "test", 123

file2.csv

member_id, tot_amnt, ignore, name
2, 1234, -1, "test2"

The following selects the common 3 columns and renames / reorders.

import pandas as pd

pd.concat([
    pd.read_csv('file1.csv',skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
    pd.read_csv('file2.csv',skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3])],
    sort=False
)

Edit 2:

And a nice way to apply this is to use lambda's and maps - i.e.

parsers = {
   "schema1": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','name','value'],usecols=[0,1,2]),
   "schema2": lambda f: pd.read_csv(f,skiprows=1,header=None,names=['id','value','name'],usecols=[0,1,3]) 
}

map = {
    "file2.csv": "schema2",
    "file1.csv": "schema1"}

pd.concat([parsers[v](k) for k,v in map.items()], sort=False)
David Waterworth
  • 2,214
  • 1
  • 21
  • 41
1

This is what i ended up doing and found to be the cleanest solution. Thanks David your help.

dict1= {'member_number': 'id', 'full name': 'name', …}
dict2= {'member_id': 'id', 'name': 'name', …}
parsers = {
   "schema1": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())),
   "schema2": lambda f, dict: pd.read_csv(f,index_col=False,usecols=list(dict.keys())) 
}      
map = {
    'schema1': (a_file.csv,dict1),
    'schema2': (b_file.csv,dict2)
}
total = []
for k,v in map.items():
    d = parsers[k](v[0], v[1])
    d.rename(columns=v[1], inplace=True)
    total.append(d)
final_df = pd.concat(total, sort=False)
juggernaut
  • 126
  • 18