Using Pandas to concatenate CSV files in directory, recursively

Question

Here is a link from a previous post. I am citing P.R.'s response below.

 import pandas as pd
    import glob
    interesting_files = glob.glob("*.csv")
    df_list = []
    for filename in sorted(interesting_files):
        df_list.append(pd.read_csv(filename))
    full_df = pd.concat(df_list)

    full_df.to_csv('output.csv')

I am wondering how to modify the above, using pandas. Specifically, I am attempting to recursively move through a directory and concatenate all of the CSV headers and their respective row values and then write it out in one file. Using P.R's approach, results in all of the headers and their corresponding values being stacked upon each other. My constraints are:

Writing out the headers and their corresponding values (without "stacking") - essentially concatenated one after the other
If the column headers in one file match another files then their should be no repetition. Only the values should be appended as they are written to the one CSV file.
Since each file has different column headers and different number of column headers these should all be added. Nothing should be deleted.

I have tried the following as well:

import pandas as pd
import csv
import glob
import os

path = '.'
files_in_dir = [f for f in os.listdir(path) if f.endswith('csv')]

for filenames in files_in_dir:
    df = pd.read_csv(filenames)
    df.to_csv('out.csv', mode='a')

Here are two sample CSV:

ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors
12821767,Query,,,,,,,,,,,

and

Type,ID,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,ISO,MID,Pass,TID,CID,Errors
UMember,12822909,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,,,,,,

Based on the above to exemplars, the output should be something along the lines of:

    ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,
12821767,Query ,,,,,,,,,,,,,,,,,,,,,,,,, etc.

(all of the header columns in the second sample should be filled in with the delimiter ',' for the second row where there is no corresponding header in the first sample)

As one, can see, the second sample has more column headers. Moreover some of the headers are the same (but in different order). I am trying to combine all of these - along with their values, following the above requirements. I am wondering if the best method is to merge or perform a customizable function on a built-in method of pandas?

Is there any particular reason you want to use `pandas` here - I'd imagine it'd be easier without it... (interesting new name @HappyLeapSecond!) — Jon Clements, Jul 09 '15 at 19:58
@HappyLeapSecond I have tried to make this clear with the bullet points - it is rather hard to explain without drawing it out. All of the headers should be in one "row"/"line" and the values in the subsequent "line". The common headers should not repeat, but the values should be added corresponding to each header. Order should be maintained and no header or value should be deleted. Any new headers should be automatically added (along with its value). — 890319ahlusar, Jul 09 '15 at 20:09
@JonClements I thought that pandas would be the most efficient method. If you have alternative suggestion, I welcome your thoughts. — 890319ahlusar, Jul 09 '15 at 20:10
@890319ahlusar is there just a single data row per CSV file? — Jon Clements, Jul 09 '15 at 20:23
@JonClements Yes. One row for the headers and one row for the data — 890319ahlusar, Jul 09 '15 at 20:38

Jon Clements · Answer 1 · 2015-07-10T13:31:26.723

1

A non pandas based approach that uses an OrderedDict and the csv module.

from glob import iglob
import csv
from collections import OrderedDict

files = sorted(iglob('*.csv'))
header = OrderedDict()
data = []
for filename in files:
    with open(filename, 'rb') as fin:
        csvin = csv.DictReader(fin)
        try:
            header.update(OrderedDict.fromkeys(csvin.fieldnames))
            data.append(next(csvin))
        except TypeError:
            print filename, 'was empty'
        except StopIteration:
            print filename, "didn't contain a row"

with open('output_filename.csv', 'wb') as fout:
    csvout = csv.DictWriter(fout, fieldnames=list(header))
    csvout.writeheader()
    csvout.writerows(data)

Given your example input, this gives you:

ID,Type,ACH,SH,LL,SS,LS,ISO,MID,Pass,TID,CID,TErrors,CC,CCD,Message,MemberIdentifier,NPass,UHB,UAP,NewAudioPIN,AType,ASuufix,Member,Share,Note,Flag,Card,MA,Preference,ETF,AutoT,RType,Locator,Errors
12821767,Query,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
12822909,UMember,,,,,,,,,,,,True,10/31/2013 5:22:19 AM,,,,False,False,,,,,,,,,,,,,Member,,

edited Jul 10 '15 at 13:31

answered Jul 09 '15 at 20:59

Jon Clements

138,671
33
247
280

I had a very similar solution to your above non-Pandas approach. The caveat is that all of the keys/column names that repeat in other csv files (and have different corresponding row values) are not appended sequentially. For example, in the above two samples, there are two different values for the column header "Type": UMember and Query. Ideally Query should be appended after UMember. If there are other similar column headers, then their values should follow the same rule. The other requirements are still controlled for. – 890319ahlusar Jul 10 '15 at 11:21
@890319ahlusar you're really going to have to spend a few moments updating your question with the exact expected output from your input examples then... – Jon Clements Jul 10 '15 at 11:24
I will do so promptly @JonClements – 890319ahlusar Jul 10 '15 at 11:29
@890319ahlusar had to pop out - updated with what I believe to be your requirements... – Jon Clements Jul 10 '15 at 13:15
@890319ahlusar going to disclose what that error is? :p – Jon Clements Jul 10 '15 at 13:25
I receive the following error (thank you, again, for your feedback): --------------------------------------------------------------------------- TypeError Traceback (most recent call last) ---> 11 header.update(OrderedDict.fromkeys(csvin.fieldnames)) 12 data.append(next(csvin)) __init__.py in fromkeys(cls, iterable, value) 220 ''' 221 self = cls() --> 222 for key in iterable: 223 self[key] = value 224 return self TypeError: 'NoneType' object is not iterable – 890319ahlusar Jul 10 '15 at 13:27
@890319ahlusar you probably have an empty file then? – Jon Clements Jul 10 '15 at 13:28
@890319ahlusar wrapped it in a try except so you can see if a file's empty or whether it only contained headers – Jon Clements Jul 10 '15 at 13:31
You are correct @JonClements - one of the 1134 files was empty! Cheers! – 890319ahlusar Jul 10 '15 at 13:32

score -1 · Answer 2 · edited May 23 '17 at 12:31

-1

In pandas, you can both append column names and reorder the data frame easily. See this article on merging frames.

To append frames and re-order them you could use the following. Re-indexing is as simple as using a list. There are more solutions here.

import pandas,os

df = None
dfList=[]
for filename in [directory+x for x in os.listdir(path)]:
    dfList.append(pd.read_csv(filename))
df=pandas.concat(dfList)
df.to_csv('out.csv', mode='w')

With list comprehension, this would be:

import pandas,os    
pandas.concat([pd.read_csv(filename) for filename in [directory+x for x in os.listdir(path) if x.endswith("csv") is True]]).to_csv('out.csv', mode='w')

If you want to reindex anything just use a list.

cols=sorted(list(df.columns.values))
df=df[cols]
#or
df=df[sorted(list(df.columns.values))]

edited May 23 '17 at 12:31

Community

1
1

answered Jul 09 '15 at 20:12

Andrew Scott Evans

1,003
12
26

It would be more performant to read df from each csv into a list and then concat all the dfs in the list rather than concatenating each one – EdChum Jul 09 '15 at 20:17
@EdChum - Is it possible for you to provide an example – 890319ahlusar Jul 09 '15 at 20:39
@890319ahlusar reading the article will also explain more complex functions but concat should cover everything – Andrew Scott Evans Jul 09 '15 at 20:42
@AndrewScottEvans Thanks for the reference materials. I have reviewed these thoroughly. However, it appears that one can only merge two data frames at a time. Is there a sub function that you would recommend that allows me to check for the same headers and then append the corresponding value of that header if the check is true? I want to understand the bowels of Pandas :) – 890319ahlusar Jul 09 '15 at 20:59
@890319ahlusar Append them in a list like so pandas.concat([df1,df2,df3]). The best example off the bat is at this post. I just saw the code. It also goes over append v. concat. – Andrew Scott Evans Jul 09 '15 at 21:04
@AndrewScottEvans I attempted this following your suggestion - it works! Cheers! – 890319ahlusar Jul 14 '15 at 11:57

Using Pandas to concatenate CSV files in directory, recursively

2 Answers2