1

I have roughly 20 csv files (all with headers) that I would like to merge all of them into 1 csv file.

Looking online, one way I found was to use the terminal command:

cat *.csv > file.csv

This worked just fine, but the problem is, as all the csv file comes with the headers, those also get placed into the csv file.

Is there a terminal command or python script on which I can merge all those csv files into one and keep only one header?

Thank you so much

Nayden Van
  • 1,133
  • 1
  • 23
  • 70
  • On which criterion can you detect the first csv file we need to copy the header and distinguish this file from the other files we need to merge the first file with but ignore the header from? Can you ensure that all csv files contain the same columns (same in order, name and number)? – albert Jul 28 '21 at 09:28

4 Answers4

1

You can do this with awk:

awk '(NR == 1) || (FNR > 1)' *.csv > file.csv

FNR refers to the record number (typically the line number) in the current file and NR refers to the total record number. So the first line of the first file is accepted and the first lines of the subsequent files are ignored.

This does assume that all your csv files have the same number of columns in the same order.

Alex
  • 6,610
  • 3
  • 20
  • 38
  • I have 2 files that have 2 extra columns. This will cause the data to don't be imported correctly right? – Nayden Van Jul 28 '21 at 09:41
  • That will cause a problem, your best option is probably to use [pandas](https://pandas.pydata.org) and a small custom script. – Alex Jul 28 '21 at 09:44
  • thank you so much, I will do that. Otherwise I can just split the files and merge them based on their column. Thank you very much – Nayden Van Jul 28 '21 at 09:46
  • Or slightly extend your Awk script to add empty columns if the input line has too few fields. – tripleee Jul 28 '21 at 09:47
  • @tripleee that's a bit beyond my awk abilities, how would you do it? – Alex Jul 28 '21 at 09:48
  • 1
    `if (NF<5) { for (i=NF+1; i<=5; ++i) $i = "" }` – tripleee Jul 28 '21 at 09:55
  • This will only work if you know the number of fields ahead of time and process the exceptional files first? – Alex Jul 28 '21 at 10:17
  • If you can trust the first file to have the correct number of fields, go with that. Processing the first line in each file before going for a full merge is a good secondary option, but still technically requires two passes in Awk. – tripleee Jul 28 '21 at 13:37
1

My vote goes to the Awk solution, but since this question explicitly asks about Python, here is a solution for that.

import csv
import sys


writer = csv.writer(sys.stdout)

firstfile = True
for file in sys.argv[1:]:
    with open(file, 'r') as rawfile:
        reader = csv.reader(rawfile)
        for idx, row in enumerate(reader):
            # enumerate() is zero-based by default; 0 is first line
            if idx == 0 and not firstfile:
                continue
            writer.writerow(row)
    firstfile = False

Usage: python script.py first.csv second.csv etc.csv >final.csv

This simple script doesn't really benefit from any Python features, but if you need to count the number of fields in non-trivial CSV files (i.e. with quoted fields which might contain a comma which isn't a separator) that's hard in Awk, and trivial in Python (because the csv library already knows exactly how to handle that).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • This seems to be promising. Can I just ask you one info. This will affect the rows, how about the column? if in one of the csv file I have an extra columns, will write that correctly? – Nayden Van Jul 28 '21 at 13:09
  • It just blindly copies input to output. If you want to force a specific number of columns, it's not hard to extend every row which is too short (`row.extend([''] * (desired_number_of_fields - len(row)))`) but perhaps that would be better handled with a new question with your requirements properly spelled out. – tripleee Jul 28 '21 at 13:35
  • thank you so much for your time. I did open another question about this – Nayden Van Jul 28 '21 at 18:25
0

This command should work for you:

tail -qn +2 *.csv > file.csv

Although, do note, you need to have an extra empty line at the end of each file, otherwise the files will concat in the same line 1, 12, 2 instead of 1, 1 in row 1 and 2, 2 in row 2.

veedata
  • 1,048
  • 1
  • 9
  • 15
  • The `-q` option is supported e.g. by GNU `tail` but is not POSIX. – tripleee Jul 28 '21 at 09:57
  • There is no "extra empty line"; a file which does not end with a newline is not a valid text file on POSIX. (Some `tail` implementations might add a newline if there isn't one already; I don't think assumptions around this behavior.) – tripleee Jul 28 '21 at 12:37
0

The below code was what worked for me.

import csv
from datetime import datetime
import glob

Time = datetime.now()
Time = Time.strftime("%Y%B%d""_""%H%M")

inputFiles = [] #[i for i in glob.glob('*.{}'.format(extension))]
for file in glob.glob("*.csv"):
    inputFiles.append(file)
print(inputFiles)

with open("combined" + Time + '.csv', 'xb') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)        
outputFile = "combined" + Time + '.csv'

for file in inputFiles:
    f = open(file, "r")  # set f as opening the given csv in the same file location
    reader = csv.reader(f)  # set reader as a readable copy of the csv
    rows = []  # set rows as an empty list
    for (
        row
    ) in (
        reader
    ):  # for every row in reader, try to append a new row in our rows list, and if now, pass
        try:
            with open(outputFile, "a", newline="") as g:
                # create a csv writer
                writer = csv.writer(g)
                # write the account number and the docket to the csv file
                writer.writerow(row)
        except:
            pass