18

I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file, where I will not have headers repeated.

So essentially I need to have just the first row with all the headers and from then I need all the rows from all CSV files merged. How do I do this?

Here's what I tried so far.

import glob
import csv



with open('output.csv','wb') as fout:
    wout = csv.writer(fout,delimiter=',') 
    interesting_files = glob.glob("*.csv") 
    for filename in interesting_files: 
        print 'Processing',filename 
    # Open and process file
        h = True
        with open(filename,'rb') as fin:
                fin.next()#skip header
        for line in csv.reader(fin,delimiter=','):
                wout.writerow(line)
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
dotpy_novice
  • 335
  • 2
  • 4
  • 16

5 Answers5

37

If you are on a linux system:

head -1 director/one_file.csv > output.csv   ## writing the header to the final file
tail -n +2  director/*.csv >> output.csv  ## writing the content of all csv starting with second line into final file
Samiron
  • 5,169
  • 2
  • 28
  • 55
valentin
  • 3,498
  • 15
  • 23
  • 8
    I needed to add '-q' (--quiet) to the tail command to avoid extra text being included in the resulting file. – ElMesa May 11 '17 at 19:13
  • as per @ElMesa, I had to use "tail **-q** -n +2 *.csv" on Windows – sanj2sanj Apr 24 '18 at 13:32
  • @jean-françois-fabre you can tune the arguments of head and tail to use/skip x lines – valentin Jun 30 '19 at 19:20
  • yes, but I mean: python csv module/pandas are able to properly detect a line. No need to know how many lines there are. I admin that if you want to do that with a known title, you can do it (even if it's not a generic solution) – Jean-François Fabre Jun 30 '19 at 19:25
22

While I think that the best answer is the one from @valentin, you can do this without using csv module at all:

import glob

interesting_files = glob.glob("*.csv") 

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)
m.wasowski
  • 6,329
  • 1
  • 23
  • 30
  • @m.wasowski This one worked, but all that it simply does is to merge the files. Even the headers are merged. I have 10 CSV files all of which have same headers. So essentially I get the header rows 10 times in the output file with this code. Any modifications suggested? – dotpy_novice May 19 '15 at 21:55
  • 1
    my bad, I moved reading header into `if` block unecessary. see fixed version; @PadraicCunningham leftover from other idea, fixed, thanks. – m.wasowski May 19 '15 at 22:09
  • it doesn't work if csv files have title lines on more than one line – Jean-François Fabre Jun 26 '19 at 19:45
8

If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.

import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
    df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

full_df.to_csv('output.csv')

Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe, the standard datatype in pandas. If you concat several of these dataframes it concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.

Another thing: I just added sorted() around the interesting_files. I assume your files are named in order and this order should be kept. I am not sure about glob, but the os functions are not necessarily returning files sorted by their name.

P.R.
  • 3,785
  • 1
  • 27
  • 47
  • 1
    I tried this, but I get the following error: Import Error: No module named Pandas. – dotpy_novice May 19 '15 at 21:16
  • 1
    it seems pandas is not installed. Try in the terminal `pip install pandas`. If you are using a simple python installation, you might benefit from installing a python distribution such as anaconda (http://continuum.io/downloads) to start with the most common packages installed. – P.R. May 19 '15 at 21:17
  • thank you so much! I will try pandas. Looks like it should make my job easier. – dotpy_novice May 19 '15 at 21:30
  • Careful here, I just debugged a script for an hour because glob decided to format my strings as integers and dropped a zero-pad from a bunch of my strings. – Blairg23 Jan 05 '16 at 01:31
  • you mean you got the filenames in the wrong order from `sorted`? – P.R. Jan 05 '16 at 10:50
  • I think it was an issue with SublimeText, because I tried the same script today and it didn't cast them. Must have been a value in cache or something. Ignore my comment. – Blairg23 Jan 05 '16 at 18:23
1

Your attempt is almost working, but the issues are:

  • you're opening the file for reading but closing it before writing the rows.
  • you're never writing the title. You have to write it once
  • Also you have to exclude output.csv from the "glob" else the output is also in input!

Here's the corrected code, passing the csv object direcly to csv.writerows method for shorter & faster code. Also writing the title from the first file to the output file.

import glob
import csv

output_file = 'output.csv'
header_written = False

with open(output_file,'w',newline="") as fout:  # just "wb" in python 2
    wout = csv.writer(fout,delimiter=',')
    # filter out output
    interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
    for filename in interesting_files:
        print('Processing {}'.format(filename))
        with open(filename) as fin:
            cr = csv.reader(fin,delmiter=",")
            header = cr.next() #skip header
            if not header_written:
                wout.writerow(header)
                header_written = True
            wout.writerows(cr)

Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.

csv module (or pandas, too) handle those cases gracefully.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
0

Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.

import csv
with open('output.csv','wb') as fout:
    wout = csv.writer(fout)
    interesting_files = glob.glob("*.csv")
    for filename in interesting_files:
        print 'Processing',filename
        with open(filename,'rb') as fin:
                next(fin) # skip header
                wout.writerows(fin)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321