creating pandas data frame from multiple files

Question

I am trying to create a pandas DataFrame and it works fine for a single file. If I need to build it for multiple files which have the same data structure. So instead of single file name I have a list of file names from which I would like to create the DataFrame.

Not sure what's the way to append to current DataFrame in pandas or is there a way for pandas to suck a list of files into a DataFrame.

zach · Accepted Answer · 2013-10-03T20:36:17.910

40

The pandas concat command is your friend here. Lets say you have all you files in a directory, targetdir. You can:

make a list of the files
load them as pandas dataframes
and concatenate them together

`

import os
import pandas as pd

#list the files
filelist = os.listdir(targetdir) 
#read them into pandas
df_list = [pd.read_table(file) for file in filelist]
#concatenate them together
big_df = pd.concat(df_list)

edited Oct 03 '13 at 20:36

answered Jan 23 '13 at 22:58

zach

29,475
16
67
88

Glob would be perfect for the first step. You can generate a list of files using wildcards. – thecircus Jan 27 '16 at 20:45
2

In my use case, I needed to do `pd.concat(df_list, ignore_index=True, axis=1)` since each csv file represented a column – zje Jun 23 '16 at 21:27

score 3 · Answer 2 · edited Jun 05 '16 at 02:51

3

Potentially horribly inefficient but...

Why not use read_csv, to build two (or more) dataframes, then use join to put them together?

That said, it would be easier to answer your question if you provide some data or some of the code you've used thus far.

edited Jun 05 '16 at 02:51

Yu Hao

119,891
44
235
294

answered May 12 '12 at 12:35

mrdevlar

123
6

score 1 · Answer 3 · edited Jun 05 '16 at 02:51

1

I might try to concatenate the files before feeding them to pandas. If you're in Linux or Mac you could use cat, otherwise a very simple Python function could do the job for you.

edited Jun 05 '16 at 02:51

Yu Hao

119,891
44
235
294

answered May 11 '12 at 06:17

Jose Blanca

105
1
6

Sure.. just curious if there is a way to do this without combining all the files. Thanks for your quick reply Jose – Abhi May 11 '12 at 06:22

score 0 · Answer 4 · answered May 11 '12 at 16:55

Are these files in a csv format. You could use the read_csv. http://pandas.sourceforge.net/io.html

Once you have read the files and save it in two dataframes, you could merge the two dataframes or add additional columns to one of the two dataframes( assuming common index). Pandas should be able to fill in missing rows.

score 0 · Answer 5 · answered Feb 04 '21 at 01:06

0

import os
import pandas as pd
data = []

thisdir = os.getcwd()

for r, d, f in os.walk(thisdir):
    for file in f:
        if ".docx" in file:
            data.append(file)

df = pd.DataFrame(data)

answered Feb 04 '21 at 01:06

Oscar Rangel

848
1
10
18

3

Thank you for contributing an answer. Would you kindly edit your answer to to include an explanation of your code? That will help future readers better understand what is going on, and especially those members of the community who are new to the language and struggling to understand the concepts. That’s especially important here where there are four other answers, including an accepted answer, competing for attention. When might your approach be preferable? Has the syntax changed since the original answer nine years ago? – Jeremy Caney Feb 04 '21 at 01:32

score 0 · Answer 6 · answered Feb 04 '22 at 19:37

Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list, it creates a dataframe for each file, you can then pd.concat them.

import fnmatch

# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files

Output which is now a list of the names:

['Feedback Form Submissions 1.21-1.25.22.csv',
 'Feedback Form Submissions 1.21.22.csv',
 'Feedback Form Submissions 1.25-1.31.22.csv']

Now create a simple list of new names to make working with them easier:

# use a simple format
names = []
for i in range(0,len(files)):
    names.append('data' + str(i))
names

['data0', 'data1', 'data2']

You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.

# i is the incrementor for the list of names
i = 0

# iterate through the file names
for file in files:
    # make an empty dataframe
    df = pd.DataFrame()
    # load the first file in
    df = pd.read_csv(file, low_memory=False)
    # get the first name from the list, this will be a string
    new_name = names[i]
    # assign the string to the variable and assign it to the dataframe 
    locals()[new_name] = df.copy()
    # increment the list of names
    i = i + 1

You now have 3 separate dataframes named data0, data1, data2, and do commands like

data2.info()

creating pandas data frame from multiple files

6 Answers6

Linked