Here is a link from a previous post. I am citing P.R.'s response below.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
I am wondering how to modify the above, using pandas. Specifically, I am attempting to recursively move through a directory and concatenate all of the CSV headers and their respective row values and then write it out in one file. Using P.R's approach, results in all of the headers and their corresponding values being stacked upon each other. My constraints are:
Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other
If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.
Since each file has different column headers and different number of column headers these should all be added. Nothing should be deleted.
I have tried the following as well:
import pandas as pd
import csv
import glob
import os
path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]
for filenames in files_in_dir:
df = pd.read_csv(filenames)
df.to_csv('out.csv', mode='a')
Here are two sample CSV:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors
12821767,Query,,,,,,,,,,,
and
Type,ID,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,ISO,MID,Pass,TID,CID,Errors
UMember,12822909,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,,,,,,
Based on the above to exemplars, the output should be something along the lines of:
ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
12821767,Query ,,,,,,,,,,,,,,,,,,,,,,,,, etc.
(all of the header columns in the second sample should be filled in with the delimiter ',' for the second row where there is no corresponding header in the first sample)
As one, can see, the second sample has more column headers. Moreover some of the headers are the same (but in different order). I am trying to combine all of these - along with their values, following the above requirements. I am wondering if the best method is to merge or perform a customizable function on a built-in method of pandas?