How can I merge 200 CSV files in Python?

Question

I here have 200 separate CSV files named from SH (1) to SH (200). I want to merge them into a single CSV file. How can I do it?

How do you want them merged? Each line in a CSV file is a row. So one simple option is to just concatenate all the files together. — Jon-Eric, Mar 25 '10 at 00:31
Each file has two columns. I want to merge them into a single file with two columns consecutively. — Chuck, Mar 25 '10 at 12:29
@Chuck: Howzabout about taking all the responses in your comments (to the question, and to the answers) and updating your question? — tumultous_rooster, Aug 17 '15 at 19:29
This question should be named "How to **concat**..." instead of "how to **merge**..." — colidyre, Aug 05 '18 at 10:09

score 109 · Accepted Answer · edited Apr 13 '23 at 06:53

109

As ghostdog74 said, but this time with headers:

with open("out.csv", "ab") as fout:
    # First file:
    with open("sh1.csv", "rb") as f:
        fout.writelines(f)

    # Now the rest:
    for num in range(2, 201):
        with open("sh" + str(num) + ".csv", "rb") as f:
            next(f) # Skip the header, portably
            fout.writelines(f)

edited Apr 13 '23 at 06:53

Peter Mortensen

30,738
21
105
131

answered Mar 25 '10 at 01:20

wisty

6,981
1
30
29

15

you can use `f.__next__()` instead if `f.next()` in python3.x. – tsveti_iko Jan 29 '18 at 14:30
7

Just a note: One can use the `with open` syntax and avoid manually `.close()`ing the files. – FatihAkici Jun 08 '18 at 18:18
2

what's the difference between `f.next()` and `f.__next__()`? when I use the former, I got `'_io.TextIOWrapper' object has no attribute 'next'` – Jia Gao Sep 08 '18 at 00:40
before ```fout.write(line)``` I would do: ```if line[-1] != '\n': line += '\n' ``` – shisui Oct 25 '18 at 00:48
@tsveti_iko: Portably, you'd do `next(f)`, which works on every version of Python from 2.6 onwards. – ShadowRanger Nov 14 '22 at 19:06
@JasonGoal: Python 3 changed the name of the special method for iterators from Py2's `.next` (which was not properly reserved) to `.__next__` (which, thanks to beginning and ending in `__`, can't be reused for other purposes by user code without violating language requirements). The portable approach is to use neither, and just call the top-level function `next` on the iterator, making it `next(f)`. – ShadowRanger Nov 14 '22 at 19:08
@ShadowRanger Thanks for the explanation. Nice to meet here at 2022. – Jia Gao Nov 16 '22 at 01:27

score 79 · Answer 2 · edited Apr 13 '23 at 06:54

79

You can just use sed 1d sh*.csv > merged.csv.

Sometimes you don't even have to use Python!

edited Apr 13 '23 at 06:54

Peter Mortensen

30,738
21
105
131

answered May 03 '11 at 21:41

blinsay

1,082
8
6

23

On windows, C:\> copy *.csv merged.csv – airstrike Jun 17 '11 at 13:30
6

Copy the header information from one file: sed -n 1p some_file.csv > merged_file.csv Copy all but the last line from all other files: sed 1d *.csv >> merged_file.csv – behas Oct 11 '11 at 17:39
3

@blinsay It adds the header in each CSV file to the merged file as well though. – Mina May 02 '14 at 01:51
6

How do you use this command without copying the header information for each subsequent file after the first one? I seem to be getting the header info popping up repeatedly. – Joe Aug 27 '14 at 04:57
2

This is great if you don't need to remove the header! – Blairg23 Jan 04 '16 at 22:28
To remove the header, do it in a loop: `for f in mydir; do sed 1d $f >> merged.csv; done` – Noumenon Jun 03 '19 at 16:14
2

I accidentally did `sed 1d *.csv > merged.csv` and this ran for while before my computer crashed because of no storage space left! :( – Nanashi No Gombe Jul 18 '19 at 12:56
this command works for me: " sed 1d *.csv >> merged_file.csv" , it shows index. result file size is 130 MB big. when I merge 10 csv files in another way, size is 14.5 MB. That is why there was storage shortage sometimes. My question: how to eliminate INDEX? – tursunWali Mar 21 '21 at 07:49

score 65 · Answer 3 · edited Apr 13 '23 at 07:13

65

Use the accepted Stack Overflow answer to create a list of CSV files that you want to append and then run this code:

import pandas as pd

combined_csv = pd.concat( [ pd.read_csv(f) for f in filenames ] )

And if you want to export it to a single CSV file, use this:

combined_csv.to_csv("combined_csv.csv", index=False)

edited Apr 13 '23 at 07:13

Peter Mortensen

30,738
21
105
131

answered Nov 17 '16 at 21:29

scottlittle

18,866
8
51
70

@wisty,@Andy, suppose all files have titles for each row - some rows with different titles. No headers for the 2 columns in each file. How can one merge, such that for each file only a column is added. – Gathide Jan 06 '17 at 11:14
Where does the file get exported to? – Dec 05 '17 at 17:52
@dirtysocks45, I changed the answer to make this more explicit. – scottlittle Dec 06 '17 at 17:51
add sort : combined_csv = pd.concat( [pd.read_csv(f) for f in filenames ], sort=False) – sailfish009 Sep 19 '19 at 04:28
1

for thousands of csv files it takes so much time and a lots of memory! – Learner Jul 09 '22 at 07:47

score 17 · Answer 4 · answered Mar 25 '10 at 00:35

17

fout=open("out.csv","a")
for num in range(1,201):
    for line in open("sh"+str(num)+".csv"):
         fout.write(line)    
fout.close()

answered Mar 25 '10 at 00:35

ghostdog74

327,991
56
259
343

Why the magic number 201? Why isn't it off by one? Is it exclusive? It might be related the question's "200 separate CSV files". An explanation of the code would be in order. – Peter Mortensen Apr 13 '23 at 07:45

score 12 · Answer 5 · edited Aug 16 '21 at 05:30

12

I'm just going to throw another code example into the basket:

from glob import glob

with open('singleDataFile.csv', 'a') as singleFile:
    for csvFile in glob('*.csv'):
        for line in open(csvFile, 'r'):
            singleFile.write(line)

edited Aug 16 '21 at 05:30

ZygD

22,092
39
79
102

answered Jul 30 '13 at 12:36

Norfeldt

8,272
23
96
152

2

@Andy I fail to see the difference between stackoverflow reminding me to vote up an answer and me reminding people to share their appreciation (by voting up) if they found my answer useful. I know that this is not Facebook and I'm not a like-hunter.. – Norfeldt May 01 '14 at 10:20
2

It has been [discussed](http://meta.stackexchange.com/a/63440/186281) [previously](http://meta.stackexchange.com/a/194063/186281), and each time it has been [deemed](http://meta.stackexchange.com/questions/167155/comments-asking-for-upvotes-accepts) unacceptable. – Andy May 01 '14 at 13:02
An explanation would be in order. What is the gist of it? The glob thing? What are the advantages and disadvantages? Does it work on Windows? Behavior on case-insensitive and case-sensitive file systems? How is it different from previous answers? (But *** *** ***without*** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Apr 13 '23 at 07:04
[An answer claims](https://stackoverflow.com/questions/2512386/how-can-i-merge-200-csv-files-in-python/25889148#25889148) it doesn't actually work. – Peter Mortensen Apr 13 '23 at 07:04

score 12 · Answer 6 · edited Apr 13 '23 at 06:49

It depends what you mean by "merging"—do they have the same columns? Do they have headers? For example, if they all have the same columns, and no headers, simple concatenation is sufficient (open the destination file for writing, loop over the sources opening each for reading, use shutil.copyfileobj from the open-for-reading source into the open-for-writing destination, close the source, keep looping—use the with statement to do the closing on your behalf). If they have the same columns, but also headers, you'll need a readline on each source file except the first, after you open it for reading before you copy it into the destination, to skip the headers line.

If the CSV files don't all have the same columns then you need to define in what sense you're "merging" them (like a SQL JOIN? or "horizontally" if they all have the same number of lines? Etc., etc.)—it's hard for us to guess what you mean in that case.

Each file has two columns with headers. I want to merge them into a single file with two columns consecutively. — Chuck, Mar 25 '10 at 14:25

score 4 · Answer 7 · edited Apr 13 '23 at 07:08

4

It is quite easy to combine all files in a directory and merge them:

import glob
import csv


# Open result file
with open('output.txt', 'wb') as fout:
    wout = csv.writer(fout, delimiter=',')
    interesting_files = glob.glob("*.csv")
    h = True
    for filename in interesting_files:
        print 'Processing', filename
        # Open and process file
        with open(filename, 'rb') as fin:
            if h:
                h = False
            else:
                fin.next() # Skip header
            for line in csv.reader(fin, delimiter=','):
                wout.writerow(line)

edited Apr 13 '23 at 07:08

Peter Mortensen

30,738
21
105
131

answered Dec 11 '14 at 11:58

varun

4,522
33
28

An explanation would be in order. What is the gist of it? E.g., why is it necessary to open the files to get information in them? What is the idea? (But *** *** ***without*** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Apr 13 '23 at 07:08

shad0w_wa1k3r · Answer 8 · 2020-08-17T21:15:24.183

You can simply use the in-built csv library. This solution will work even if some of your CSV files have slightly different column names or headers, unlike the other top-voted answers.

import csv
import glob


filenames = [i for i in glob.glob("SH*.csv")]
header_keys = []
merged_rows = []

for filename in filenames:
    with open(filename) as f:
        reader = csv.DictReader(f)
        merged_rows.extend(list(reader))
        header_keys.extend([key for key in reader.fieldnames if key not in header_keys])

with open("combined.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames=header_keys)
    w.writeheader()
    w.writerows(merged_rows)

The merged file will contain all possible columns (header_keys) that can be found in the files. Any absent columns in a file would be rendered as blank / empty (but preserving rest of the file's data).

Note:

This won't work if your CSV files have no headers. In that case you can still use the csv library, but instead of using DictReader & DictWriter, you'll have to work with the basic reader & writer.
This may run into issues when you are dealing with massive data since the entirety of the content is being store in memory (merged_rows list).

For the writer, `fieldnames=` can be any iterable, so a set or even a dict will do and you can drop the `keys.extend([... if ... not])` list comprehension in favor of `keys.update(reader.fieldnames)`. — Zach Young, Aug 30 '22 at 18:42

score 3 · Answer 9 · answered Mar 25 '10 at 00:35

3

If the merged CSV is going to be used in Python then just use glob to get a list of the files to pass to fileinput.input() via the files argument, then use the csv module to read it all in one go.

answered Mar 25 '10 at 00:35

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

score 3 · Answer 10 · edited Apr 13 '23 at 07:02

3

A slight change to Norfeldt's code as it does not actually work correctly.

It should be as follows...

from glob import glob

with open('main.csv', 'a') as singleFile:
    for csv in glob('*.csv'):
        if csv == 'main.csv':
            pass
        else:
            for line in open(csv, 'r'):
                singleFile.write(line)

edited Apr 13 '23 at 07:02

Peter Mortensen

30,738
21
105
131

answered Sep 17 '14 at 11:19

Adders

665
8
29

Re *"does not actually work correctly"*: On what system (incl. versions) and under what conditions (e.g., the set of files) did that not work? Can you be more specific than *"does not actually work correctly"*? In what way did it not work correctly? – Peter Mortensen Apr 13 '23 at 07:05

score 3 · Answer 11 · edited Apr 13 '23 at 07:51

3

If you are working on Linux or Mac, you can do this.

from subprocess import call

script = "cat *.csv > merge.csv"
call(script, shell=True)

edited Apr 13 '23 at 07:51

Peter Mortensen

30,738
21
105
131

answered Jul 11 '18 at 08:35

Kondalarao V

790
5
3

score 2 · Answer 12 · answered Jul 18 '19 at 12:56

2

OR, you could just do

cat sh*.csv > merged.csv

answered Jul 18 '19 at 12:56

Nanashi No Gombe

510
1
6
19

This will also copy the header line of the files for each file. – Dor Meiri Apr 01 '21 at 22:54
This is operating system / shell dependent. What is assumed? [Linux](https://en.wikipedia.org/wiki/Linux_Mint)? – Peter Mortensen Jun 27 '23 at 14:03

score 2 · Answer 13 · edited Apr 13 '23 at 08:14

Over the solution that was made by Adders and later on improved by varun, I implemented some little improvement to leave the whole merged CSV with only the main header:

from glob import glob

filename = 'main.csv'

with open(filename, 'a') as singleFile:
    first_csv = True
    for csv in glob('*.csv'):
        if csv == filename:
            pass
        else:
            header = True
            for line in open(csv, 'r'):
                if first_csv and header:
                    singleFile.write(line)
                    first_csv = False
                    header = False
                elif header:
                    header = False
                else:
                    singleFile.write(line)
    singleFile.close()

score 1 · Answer 14 · edited Apr 13 '23 at 06:51

You could import the csv module then loop through all the CSV files reading them into a list. Then write the list back out to disk.

import csv

rows = []

for f in (file1, file2, ...):
    reader = csv.reader(open("f", "rb"))

    for row in reader:
        rows.append(row)

writer = csv.writer(open("some.csv", "wb"))
writer.writerows("\n".join(rows))

The above is not very robust as it doesn't have any error handling nor does it close any open files.

This should work whether or not the the individual files have one or more rows of CSV data in them. Also I did not run this code, but it should give you an idea of what to do.

score 1 · Answer 15 · edited Apr 13 '23 at 08:17

I have done it by implementing a function that expects an output file and paths of the input files.

The function copies the file content of the first file into the output file and then does the same for the rest of input files, but without the header line.

def concat_files_with_header(output_file, *paths):
    for i, path in enumerate(paths):
        with open(path) as input_file:
            if i > 0:
                next(input_file)  # Skip header
            output_file.writelines(input_file)

Usage example of the function:

if __name__ == "__main__":
    paths = [f"sh{i}.csv" for i in range(1, 201)]
    with open("output.csv", "w") as output_file:
        concat_files_with_header(output_file, *paths)

score -1 · Answer 16 · edited Apr 13 '23 at 07:36

I modified what wisty said to be working with Python 3.x, for those of you that have an encoding problem. Also I use the os module to avoid hard coding.

import os

def merge_all():
    dir = os.chdir('C:\python\data\\')
    fout = open("merged_files.csv", "ab")

    # First file:
    for line in open("file_1.csv", 'rb'):
        fout.write(line)

    # Now the rest:
    list = os.listdir(dir)
    number_files = len(list)
    for num in range(2, number_files):
        f = open("file_" + str(num) + ".csv", 'rb')
        f.__next__()  # Skip the header
        for line in f:
            fout.write(line)
        f.close()  # Not really needed
    fout.close()

score -1 · Answer 17 · edited Apr 13 '23 at 07:37

Here is a script:

Concatenating CSV files named SH1.csv to SH200.csv
Keeping the headers

import glob
import re

# Looking for filenames like 'SH1.csv' ... 'SH200.csv'
pattern = re.compile("^SH([1-9]|[1-9][0-9]|1[0-9][0-9]|200).csv$")
file_parts = [name for name in glob.glob('*.csv') if pattern.match(name)]

with open("file_merged.csv","wb") as file_merged:
    for (i, name) in enumerate(file_parts):
        with open(name, "rb") as file_part:
            if i != 0:
                next(file_part) # Skip headers if not the first file
            file_merged.write(file_part.read())

score -1 · Answer 18 · edited Apr 13 '23 at 07:42

-1

Updating wisty's answer for Python 3:

fout = open("out.csv", "a")
# First file:
for line in open("sh1.csv"):
    fout.write(line)

# Now the rest:
for num in range(2, 201):
    f = open("sh" + str(num) + ".csv")
    next(f) # Skip the header
    for line in f:
         fout.write(line)
    f.close() # Not really needed
fout.close()

edited Apr 13 '23 at 07:42

Peter Mortensen

30,738
21
105
131

answered May 02 '18 at 20:09

ishandutta2007

16,676
16
93
129

What is the magic number "2"? To compensate because the first answer using the magic number 201 was off by one? Or for some other reason? An explanation of the code and the reason for choices in it would be in order. – Peter Mortensen Apr 13 '23 at 07:49

score -1 · Answer 19 · edited Apr 13 '23 at 08:06

Let's say you have two CSV files like these:

File csv1.csv

id,name
1,Armin
2,Sven

File csv2.csv

id,place,year
1,Reykjavik,2017
2,Amsterdam,2018
3,Berlin,2019

And you want the result to be like this (file csv3.csv):

id,name,place,year
1,Armin,Reykjavik,2017
2,Sven,Amsterdam,2018
3,,Berlin,2019

Then you can use the following snippet to do that:

import csv
import pandas as pd

# The file names
f1 = "csv1.csv"
f2 = "csv2.csv"
out_f = "csv3.csv"

# Read the files
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)

# Get the keys
keys1 = list(df1)
keys2 = list(df2)

# Merge both files
for idx, row in df2.iterrows():
    data = df1[df1['id'] == row['id']]

    # If row with such id does not exist, add the whole row
    if data.empty:
        next_idx = len(df1)
        for key in keys2:
            df1.at[next_idx, key] = df2.at[idx, key]

    # If row with such id exists, add only the missing keys with their values
    else:
        i = int(data.index[0])
        for key in keys2:
            if key not in keys1:
                df1.at[i, key] = df2.at[idx, key]

# Save the merged files
df1.to_csv(out_f, index=False, encoding='utf-8', quotechar="", quoting=csv.QUOTE_NONE)

With the help of a loop, you can achieve the same result for multiple files as it is in your case (200 CSV files).

score -1 · Answer 20 · edited Apr 13 '23 at 08:10

-1

If the files aren't numbered in order, take the hassle-free approach below:

Python 3.6 on a Windows machine:

import pandas as pd
from glob import glob

interesting_files = glob("C:/temp/*.csv") # It grabs all the csv files from
                                          # the directory you mention here

df_list = []
for filename in sorted(interesting_files):

df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)

# Save the final file in same/different directory:
full_df.to_csv("C:/temp/merged_pandas.csv", index=False)

edited Apr 13 '23 at 08:10

Peter Mortensen

30,738
21
105
131

answered Feb 11 '19 at 11:13

Azadeh Feizpour

49
1
6

This trivially doesn't work (indentation - `IndentationError: expected an indented block`). Where did you copy it from? – Peter Mortensen Apr 13 '23 at 08:10

score -1 · Answer 21 · answered Sep 15 '19 at 13:58

An easy-to-use function:

def csv_merge(destination_path, *source_paths):
'''
Merges all csv files on source_paths to destination_path.
:param destination_path: Path of a single csv file, doesn't need to exist
:param source_paths: Paths of csv files to be merged into, needs to exist
:return: None
'''
with open(destination_path,"a") as dest_file:
    with open(source_paths[0]) as src_file:
        for src_line in src_file.read():
            dest_file.write(src_line)
    source_paths.pop(0)
    for i in range(len(source_paths)):
        with open(source_paths[i]) as src_file:
            src_file.next()
            for src_line in src_file:
                 dest_file.write(src_line)

score -1 · Answer 22 · edited Mar 23 '20 at 19:28

import pandas as pd
import os

df = pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\Sales_April_2019.csv")
files = [file for file in  os.listdir("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data")
for file in files:
    print(file)

all_data = pd.DataFrame()
for file in files:
    df=pd.read_csv("e:\\data science\\kaggle assign\\monthly sales\\Pandas-Data-Science-Tasks-master\\SalesAnalysis\\Sales_Data\\"+file)
    all_data = pd.concat([all_data,df])
    all_data.head()

How can I merge 200 CSV files in Python?

22 Answers22

File csv1.csv

File csv2.csv

Linked

Related