751

I would like to read several CSV files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I have so far:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

I guess I need some help within the for loop?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jonas
  • 13,559
  • 22
  • 57
  • 75

20 Answers20

858

See pandas: IO tools for all of the available .read_ methods.

Try the following code if all of the CSV files have the same columns.

I have added header=0, so that after reading the CSV file's first row, it can be assigned as the column names.

import pandas as pd
import glob
import os

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

Or, with attribution to a comment from Sid.

all_files = glob.glob(os.path.join(path, "*.csv"))

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

  • It's often necessary to identify each sample of data, which can be accomplished by adding a new column to the dataframe.
  • pathlib from the standard library will be used for this example. It treats paths as objects with methods, instead of strings to be sliced.

Imports and Setup

from pathlib import Path
import pandas as pd
import numpy as np

path = r'C:\DRO\DCL_rawdata_files'  # or unix / linux / mac path

# Get the files from the path provided in the OP
files = Path(path).glob('*.csv')  # .rglob to get subdirectories

Option 1:

  • Add a new column with the file name
dfs = list()
for f in files:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['file'] = f.stem
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 2:

  • Add a new column with a generic name using enumerate
dfs = list()
for i, f in enumerate(files):
    data = pd.read_csv(f)
    data['file'] = f'File {i}'
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

Option 3:

  • Create the dataframes with a list comprehension, and then use np.repeat to add a new column.
    • [f'S{i}' for i in range(len(dfs))] creates a list of strings to name each dataframe.
    • [len(df) for df in dfs] creates a list of lengths
  • Attribution for this option goes to this plotting answer.
# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]

# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)

# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])

Option 4:

  • One liners using .assign to create the new column, with attribution to a comment from C8H10N4O2
df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)

or

df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Gaurav Singh
  • 12,707
  • 5
  • 22
  • 24
  • 447
    The same thing more concise, and perhaps faster as it doesn't use a list: `df = pd.concat((pd.read_csv(f) for f in all_files))` Also, one should perhaps use `os.path.join(path, "*.csv")` instead of `path + "/*.csv"`, which makes it OS independent. – Sid Jan 23 '16 at 00:41
  • This is an excellent answer! – Adam Jaamour Sep 05 '22 at 09:32
366

An alternative to darindaCoder's answer:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
Community
  • 1
  • 1
Sid
  • 5,662
  • 2
  • 15
  • 18
  • 5
    @Mike @Sid the final two lines can be replaced by: `pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)`. The inner brackets are required by Pandas version 0.18.1 – Dr Fabio Gori Oct 31 '16 at 15:27
  • 16
    I recommend using `glob.iglob` instead of `glob.glob`; The first one returns and [iterator (instead of a list)](https://docs.python.org/3/library/glob.html#glob.iglob). – toto_tico Aug 02 '17 at 12:52
117
import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))
Asocia
  • 5,935
  • 2
  • 21
  • 46
Jose Antonio Martin H
  • 1,453
  • 1
  • 11
  • 10
  • 5
    Excellent one liner, specially useful if no read_csv arguments are needed! – rafaelvalle Nov 09 '17 at 19:38
  • 26
    If, on the other hand, arguments are needed, this can be done with lambdas: `df = pd.concat(map(lambda file: pd.read_csv(file, delim_whitespace=True), data_files))` – fiedl Apr 11 '18 at 14:46
  • 2
    ^ or with `functools.partial`, to avoid lambdas – cs95 May 27 '19 at 05:10
91

Almost all of the answers here are either unnecessarily complex (glob pattern matching) or rely on additional third-party libraries. You can do this in two lines using everything Pandas and Python (all versions) already have built in.

For a few files - one-liner

df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

For many files

import os

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

For No Headers

If you have specific things you want to change with pd.read_csv (i.e., no headers) you can make a separate function and call that with your map:

def f(i):
    return pd.read_csv(i, header=None)

df = pd.concat(map(f, filepaths))

This pandas line, which sets the df, utilizes three things:

  1. Python's map (function, iterable) sends to the function (the pd.read_csv()) the iterable (our list) which is every CSV element in filepaths).
  2. Panda's read_csv() function reads in each CSV file as normal.
  3. Panda's concat() brings all these under one df variable.
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
robmsmt
  • 1,389
  • 11
  • 19
  • 5
    or just `df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv))` – muon Mar 01 '19 at 18:05
  • 1
    I tried the method prescribed by @muon. But, i have multiple files with headers(headers are common). I don't want them to be concatenated in the dataframe. Do you know how can i do that ? I tried `df = pd.concat(map(pd.read_csv(header=0), glob.glob('data/*.csv))` but it gave an error "parser_f() missing 1 required positional argument: 'filepath_or_buffer'" – cadip92 Mar 03 '20 at 13:14
  • It's a little while since you asked... but I updated my answer to include answers without headers (or if you want to pass any change to read_csv). – robmsmt Nov 05 '21 at 03:01
82

Easy and Fast

Import two or more CSV files without having to make a list of names.

import glob
import pandas as pd

df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MrFun
  • 2,303
  • 1
  • 15
  • 16
63

The Dask library can read a dataframe from multiple files:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(Source: https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files)

The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.

Jouni K. Seppänen
  • 43,139
  • 5
  • 71
  • 100
22

I googled my way into Gaurav Singh's answer.

However, as of late, I am finding it faster to do any manipulation using NumPy and then assigning it once to a dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

I do sincerely want anyone hitting this page to consider this approach, but I don't want to attach this huge piece of code as a comment and making it less readable.

You can leverage NumPy to really speed up the dataframe concatenation.

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1", "col2"....]

Timing statistics:

total files :192
avg lines per file :8492
--approach 1 without NumPy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with NumPy -- 2.289292573928833 seconds ---
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
SKG
  • 1,432
  • 2
  • 13
  • 23
17

A one-liner using map, but if you'd like to specify additional arguments, you could do:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
                    glob.glob("data/*.csv")))

Note: map by itself does not let you supply additional arguments.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
muon
  • 12,821
  • 11
  • 69
  • 88
14

If you want to search recursively (Python 3.5 or above), you can do the following:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

Note that the three last lines can be expressed in one single line:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.



EDIT: Multiplatform recursive function:

You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

df = read_df_rec('C:\user\your\path', *.csv)

Here is the function:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)
toto_tico
  • 17,977
  • 9
  • 97
  • 116
9

Inspired from MrFun's answer:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

Notes:

  1. By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.

  2. In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like

        timestamp    id    valid_frame
    0
    1
    2
    .
    .
    .
    0
    1
    2
    .
    .
    .
    

    With ignore_index=True, it looks like:

        timestamp    id    valid_frame
    0
    1
    2
    .
    .
    .
    108
    109
    .
    .
    .
    

    IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0]

    Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

Milan
  • 1,743
  • 2
  • 13
  • 36
8

Another one-liner with list comprehension which allows to use arguments with read_csv.

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mjspier
  • 6,386
  • 5
  • 33
  • 43
  • Perfect for me, since my csv filenames all ended with the same words, but my filenames started with a different datetimestamp – DaReal Oct 27 '22 at 20:24
7

If multiple CSV files are zipped, you may use zipfile to read all and concatenate as below:

import zipfile
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train = []

train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]

df = pd.concat(train)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nim J
  • 993
  • 2
  • 9
  • 15
7

Alternative using the pathlib library (often preferred over os.path).

This method avoids iterative use of pandas concat()/apped().

From the pandas documentation:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)
Henrik
  • 1,101
  • 9
  • 7
4

Based on Sid's good answer.

To identify issues of missing or unaligned columns

Before concatenating, you can load CSV files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

Import modules and locate file paths:

import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.

Load CSV files into a dictionary. Then concatenate:

dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)

Keys are file names f and values are the data frame content of CSV files.

Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
3
import os

os.system("awk '(NR == 1) || (FNR > 1)' file*.csv > merged.csv")

Where NR and FNR represent the number of the line being processed.

FNR is the current line within each file.

NR == 1 includes the first line of the first file (the header), while FNR > 1 skips the first line of each subsequent file.

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
2

In case of an unnamed column issue, use this code for merging multiple CSV files along the x-axis.

import glob
import os
import pandas as pd

merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
        os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)

merged_df.to_csv("merged.csv")
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0

You can do it this way also:

import pandas as pd
import os

new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
    for file in f:
        complete_file_path = csv_folder_path+file
        read_file = pd.read_csv(complete_file_path)
        new_df = new_df.append(read_file, ignore_index=True)


new_df.shape
neha
  • 1,858
  • 5
  • 21
  • 35
0

Consider using convtools library, which provides lots of data processing primitives and generates simple ad hoc code under the hood. It is not supposed to be faster than pandas/polars, but sometimes it can be.

e.g. you could concat csv files into one for further reuse - here's the code:

import glob

from convtools import conversion as c
from convtools.contrib.tables import Table
import pandas as pd


def test_pandas():
    df = pd.concat(
        (
            pd.read_csv(filename, index_col=None, header=0)
            for filename in glob.glob("tmp/*.csv")
        ),
        axis=0,
        ignore_index=True,
    )
    df.to_csv("out.csv", index=False)
# took 20.9 s


def test_convtools():
    table = None
    for filename in glob.glob("tmp/*.csv"):
        table_ = Table.from_csv(filename, header=False)
        if table is None:
            table = table_
        else:
            table = table.chain(table_)

    table.into_csv("out_convtools.csv", include_header=False)
# took 15.8 s

Of course if you just want to obtain a dataframe without writing a concatenated file, it will take 4.63 s and 10.9 s correspondingly (pandas is faster here because it doesn't need to zip columns for writing it back).

westandskif
  • 972
  • 6
  • 9
-2
import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)
YASH GUPTA
  • 237
  • 2
  • 2
-2

This is how you can do it using Colaboratory on Google Drive:

import pandas as pd
import glob

path = r'/content/drive/My Drive/data/actual/comments_only' # Use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True,sort=True)
frame.to_csv('/content/drive/onefile.csv')
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shaina Raza
  • 1,474
  • 17
  • 12