107

I have the following for loop:

for i in links:
     data = urllib2.urlopen(str(i)).read()
     data = json.loads(data)
     data = pd.DataFrame(data.items())
     data = data.transpose()
     data.columns = data.iloc[0]
     data = data.drop(data.index[[0]])

Each dataframe so created has most columns in common with the others but not all of them. Moreover, they all have just one row. What I need to to is to add to the dataframe all the distinct columns and each row from each dataframe produced by the for loop

I tried pandas concatenate or similar but nothing seemed to work. Any idea? Thanks.

Blue Moon
  • 4,421
  • 20
  • 52
  • 91
  • since you don't know the columns beforehand, which seems to be what Pandas.DataFrame is designed for, you should probably generate a giant List of Lists, using `np.unique()` etc. to generate the full Columns list; at the end of the loop, create the DataFrame from the List. – Demis Mar 23 '21 at 03:59

5 Answers5

113

Suppose your data looks like this:

import pandas as pd
import numpy as np

np.random.seed(2015)
df = pd.DataFrame([])
for i in range(5):
    data = dict(zip(np.random.choice(10, replace=False, size=5),
                    np.random.randint(10, size=5)))
    data = pd.DataFrame(data.items())
    data = data.transpose()
    data.columns = data.iloc[0]
    data = data.drop(data.index[[0]])
    df = df.append(data)
print('{}\n'.format(df))
# 0   0   1   2   3   4   5   6   7   8   9
# 1   6 NaN NaN   8   5 NaN NaN   7   0 NaN
# 1 NaN   9   6 NaN   2 NaN   1 NaN NaN   2
# 1 NaN   2   2   1   2 NaN   1 NaN NaN NaN
# 1   6 NaN   6 NaN   4   4   0 NaN NaN NaN
# 1 NaN   9 NaN   9 NaN   7   1   9 NaN NaN

Then it could be replaced with

np.random.seed(2015)
data = []
for i in range(5):
    data.append(dict(zip(np.random.choice(10, replace=False, size=5),
                         np.random.randint(10, size=5))))
df = pd.DataFrame(data)
print(df)

In other words, do not form a new DataFrame for each row. Instead, collect all the data in a list of dicts, and then call df = pd.DataFrame(data) once at the end, outside the loop.

Each call to df.append requires allocating space for a new DataFrame with one extra row, copying all the data from the original DataFrame into the new DataFrame, and then copying data into the new row. All that allocation and copying makes calling df.append in a loop very inefficient. The time cost of copying grows quadratically with the number of rows. Not only is the call-DataFrame-once code easier to write, its performance will be much better -- the time cost of copying grows linearly with the number of rows.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 7
    I was really surprised about the performance difference : adding 100 lines of 5 values takes 336ms with dataframe.append (I tried dataframe.loc[i] , it seems to be the same) The same as the solution proposed takes only 4.8ms ! – 2diabolos.com Aug 18 '16 at 14:44
  • Note: `.append` will be deprecated in future versions of pandas. Use `pd.concat` instead. – jonnyg23 Sep 09 '22 at 17:38
105

There are 2 reasons you may append rows in a loop, 1. add to an existing df, and 2. create a new df.

to create a new df, I think its well documented that you should either create your data as a list and then create the data frame:

cols = ['c1', 'c2', 'c3']
lst = []
for a in range(2):
    lst.append([1, 2, 3])
df1 = pd.DataFrame(lst, columns=cols)
df1
Out[3]: 
   c1  c2  c3
0   1   2   3
1   1   2   3

OR, Create the dataframe with an index and then add to it

cols = ['c1', 'c2', 'c3']
df2 = pd.DataFrame(columns=cols, index=range(2))
for a in range(2):
    df2.loc[a].c1 = 4
    df2.loc[a].c2 = 5
    df2.loc[a].c3 = 6
df2
Out[4]: 
  c1 c2 c3
0  4  5  6
1  4  5  6

If you want to add to an existing dataframe, you could use either method above and then append the df's together (with or without the index):

df3 = df2.append(df1, ignore_index=True)
df3
Out[6]: 
  c1 c2 c3
0  4  5  6
1  4  5  6
2  1  2  3
3  1  2  3

Or, you can also create a list of dictionary entries and append those as in the answer above.

lst_dict = []
for a in range(2):
    lst_dict.append({'c1':2, 'c2':2, 'c3': 3})
df4 = df1.append(lst_dict)
df4
Out[7]: 
   c1  c2  c3
0   1   2   3
1   1   2   3
0   2   2   3
1   2   2   3

Using the dict(zip(cols, vals)))

lst_dict = []
for a in range(2):
    vals = [7, 8, 9]
    lst_dict.append(dict(zip(cols, vals)))
df5 = df1.append(lst_dict)

Including the idea from the comment below:

It turns out Pandas does have an effective way to append to a dataframe:

df.loc[ len(df) ] = [new, row, of, data] 

(this) will "append" to the end of a dataframe in-place. – Demis Mar 22 at 15:32

kztd
  • 3,121
  • 1
  • 20
  • 18
  • Thank you, I had not found it documented clearly that creating the DF dynamically during a loop is not recommended - it seemed the logical thing to create the final datastructure in-place during the loop rather than a temporary List first (which causes you to have made Two arrays of the same data). Why is appending to a dataframe discouraged, as opposed to generating the list first? If you had enormous datasets, it sounds like this would use twice the resources. – Demis Mar 17 '21 at 19:36
  • Its been a while but if I remember correctly, when you append you end up copying the whole thing somehow. https://stackoverflow.com/questions/55967976/python-panda-append-dataframe-in-loop?noredirect=1&lq=1 – kztd Mar 18 '21 at 03:38
  • (I didn’t mean the method ‘append()’ specifically.) Pandas doesn’t have a clean way to add to a dataset in-place? Like the List.append method? I’d find it surprising that a data analysis package would require duplicating the full data set in order to create it. – Demis Mar 19 '21 at 04:10
  • sorry I didn't write it, I just use it. – kztd Mar 19 '21 at 15:36
  • It turns out Pandas does have an effective way to append to a dataframe: `df.loc( len(df) ) = [new, row, of, data]` will "append" to the end of a dataframe in-place. – Demis Mar 22 '21 at 15:32
  • sounds great – kztd Mar 22 '21 at 23:47
  • I had issues with this - I started with an empty dataframe and tried to append using `df.loc[i-1]` as I was using a loop. And I kept getting errors... and it looked like it wanted to append multiple rows with one column, instead of multiple columns and one row. Eventually, figured out that I MUST create the empty dataset with `index=[0]` so that it stops ordering the columns when I use the `df.columns` function. – GenDemo Jun 02 '23 at 03:52
15

A more compact and efficient way would be perhaps:

cols = ['frame', 'count']
N = 4
dat = pd.DataFrame(columns = cols)
for i in range(N):

    dat = dat.append({'frame': str(i), 'count':i},ignore_index=True)

output would be:

>>> dat
   frame count
0     0     0
1     1     1
2     2     2
3     3     3
Ayanava Sarkar
  • 371
  • 3
  • 4
  • 2
    Why is this efficient? Efficient in memory, time, or amount of code ? Looks like it'll use twice the memory since it has to overwrite the whole DF with the same DF each iteration. – Demis Mar 23 '21 at 03:56
4

I have created a data frame in a for loop with the help of a temporary empty data frame. Because for every iteration of for loop, a new data frame will be created thereby overwriting the contents of previous iteration.

Hence I need to move the contents of the data frame to the empty data frame that was created already. It's as simple as that. We just need to use .append function as shown below :

temp_df = pd.DataFrame() #Temporary empty dataframe
for sent in Sentences:
    New_df = pd.DataFrame({'words': sent.words}) #Creates a new dataframe and contains tokenized words of input sentences
    temp_df = temp_df.append(New_df, ignore_index=True) #Moving the contents of newly created dataframe to the temporary dataframe

Outside the for loop, you can copy the contents of the temporary data frame into the master data frame and then delete the temporary data frame if you don't need it

JKC
  • 2,498
  • 6
  • 30
  • 56
4

First, create a empty DataFrame with column names, after that, inside the for loop, you must define a dictionary (a row) with the data to append:

df = pd.DataFrame(columns=['A'])
for i in range(5):
    df = df.append({'A': i}, ignore_index=True)
df
   A
0  0
1  1
2  2
3  3
4  4

If you want to add a row with more columns, the code will looks like this:

df = pd.DataFrame(columns=['A','B','C'])
for i in range(5):
    df = df.append({'A': i,
                    'B': i * 2,
                    'C': i * 3,
                   }
                   ,ignore_index=True
                  )
df
    A   B   C
0   0   0   0
1   1   2   3
2   2   4   6
3   3   6   9
4   4   8   12

Source

ekz4
  • 81
  • 7