Appending pandas dataframes generated in a for loop

Question

I am accessing a series of Excel files in a for loop. I then read the data in the excel file to a pandas dataframe. I cant figure out how to append these dataframes together to then save the dataframe (now containing the data from all the files) as a new Excel file.

Here's what I tried:

for infile in glob.glob("*.xlsx"):
    data = pandas.read_excel(infile)
    appended_data = pandas.DataFrame.append(data) # requires at least two arguments
appended_data.to_excel("appended.xlsx")

Thanks!

score 289 · Accepted Answer · edited Jul 19 '19 at 21:48

289

Use pd.concat to merge a list of DataFrame into a single big DataFrame.

appended_data = []
for infile in glob.glob("*.xlsx"):
    data = pandas.read_excel(infile)
    # store DataFrame in list
    appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet 
appended_data.to_excel('appended.xlsx')

edited Jul 19 '19 at 21:48

Lauren Fitch

346
4
12

answered Feb 23 '15 at 09:12

biobirdman

4,060
1
17
15

1

Fantastic, thanks. Do you know if there is an easy way to add an identifier to each DataFrame to the final excel file? The purpose would be to be able to track which file the data came from. – El Confuso Feb 23 '15 at 10:27
7

Add a new column with the file name when you read the data. Could be something as simple as `data['filename'］ = infile`. – biobirdman Feb 23 '15 at 10:45
1

Great! Thanks a bunch. If anyone in the future wants to try this just replace the `]` with a non-superscript one :) – El Confuso Feb 23 '15 at 11:02
1

This does the job but you get an excel file where rows are not appended below existing ones; rather, the new dataframe is pasted right next to the existing one. – FaCoffee Apr 16 '18 at 16:36
7

@FaCoffee , remove the `axis=1` from the code to bind the dfs below each other. :) – anky Nov 30 '18 at 12:01
2

Unless otherwise specified, I think "append" means by rows, not columns, and would recommend removing `axis=1` from this answer. – Max Ghenis Dec 20 '18 at 00:00
Sorry, I dont understand, you have appended data, why need to use `concat`? – ah bon Jan 21 '20 at 09:57
@ahbon your dataframes are in a list called `appended_data`. `pd.concat` turns that list of dataframes into a single dataframe – Riebeckite Oct 15 '20 at 13:44

score 65 · Answer 2 · answered May 24 '16 at 02:01

65

you can try this.

data_you_need=pd.DataFrame()
for infile in glob.glob("*.xlsx"):
    data = pandas.read_excel(infile)
    data_you_need=data_you_need.append(data,ignore_index=True)

I hope it can help.

answered May 24 '16 at 02:01

ye jiawei

882
7
7

5

There is O(N^2) notation, see [http://stackoverflow.com/questions/37009287/using-pandas-append-within-for-loop](http://stackoverflow.com/questions/37009287/using-pandas-append-within-for-loop) – franchb Dec 28 '16 at 08:43
2

Hi @Ilya - assuming you are referring to alexander's post in the link - the slower performance referenced is because of using `DataFrame.append` operation instead of `list.append` operation - not because of the `DataFrame.append` operation instead of a `DataFrame.concat` operation. – Charlie Nov 09 '17 at 05:30
I'm not sure how python manages memory, but I would expect that the append operation would use smaller than or equal amounts of memory (because irrelevant information is garbage collected), and the concat operation probably uses append 'under the hood' - leading to little, possibly none performance improvement (and possibly poorer performance given the greater memory requirements). – Charlie Nov 09 '17 at 05:32
4

i tried this but my data_you_need is empty. dont know whats went wrong – duckman Apr 13 '18 at 02:40
2

Never grow a dataframe! Append to list instead. Check this out https://i.stack.imgur.com/Ag2NQ.png from https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe – Abu Shoeb Sep 09 '20 at 01:54
The important part being `ignore_index=True`, otherwise the dataframe keeps being overwritten. – Skippy le Grand Gourou Dec 15 '20 at 12:43

score 4 · Answer 3 · edited Jan 10 '23 at 06:22

4

DataFrame.append() and Series.append() have been deprecated and will be removed in a future version. Use pandas.concat() instead (GH35407).

edited Jan 10 '23 at 06:22

ah bon

9,293
12
65
148

answered Mar 17 '22 at 00:51

BO-CHRISTOPHER REDFEARN

41
2

score 0 · Answer 4 · answered Aug 03 '23 at 22:01

Appending is now deprecated. Use concat instead. Here's an example for a project I'm currently working on. I wanted to update this with an example using concat.

import os
import pandas as pd
import glob

# reference sub floder in project folder
path = r'DataFiles'

# search for all excel files
filenames = glob.glob(path + "\*.xlsx")
# print(filenames)

# initialize an empty dataframe
df_raw = pd.DataFrame()

#loop over list of excel files
for file in filenames:
  
    # combining multiple excel worksheets 
    # into single data frames
    df = pd.concat(pd.read_excel(file, sheet_name=None),
                   ignore_index=True, sort=False)
    # print(df.shape)
    data = [df_raw, df]
    df_raw = pd.concat(data, ignore_index = True, sort=False)

Appending pandas dataframes generated in a for loop

4 Answers4

Linked

Related