1

I am trying to append a row (df_row) with every iteration to a parent dataframe (df_all). The parent dataframe has all the possible column values and every iteration produces a row with a unique set of columns which are a subset of the all possible columns. It looks something like this:

df_all

initially has all the possible column names:

Index A B C D E F G H

Iteration 1:

df_row1:

Index A C D E F
  ID1 1 2 3 5 1 

df_all=df_all.append(df_row1)

Now df_all looks as below:

df_all:

Index A  B  C  D  E  F  G  H 
  ID1 1  na 2  3  5  1 na na

Iteration 2:

df_row2:

Index A B D F G H
  ID2 0 8 3 5 1 4
df_all=df_all.append(df_row2)

Now df_all looks as below:

df_all:

Index A  B  C  D  E  F  G  H 
  ID1 1  na 2  3  5  1 na na
  ID2 0  8  na 3  na 5  1  4

And so on...

However, I have >20000 rows to add and the time taken to add every row is increasing with every new iteration. Is there a way to add this more efficiently within a reasonable amount of time?

kdba
  • 433
  • 5
  • 13

2 Answers2

3

Notice that you can build a DataFrame from a list of Series or dicts:

In [46]: pd.DataFrame([pd.Series({'A':1,'B':2}), pd.Series({'A':2,'C':3})])
Out[186]: 
     A    B    C
0  1.0  2.0  NaN
1  2.0  NaN  3.0

In [187]: pd.DataFrame([{'A':1,'B':2}, {'A':2,'C':3}])
Out[187]: 
   A    B    C
0  1  2.0  NaN
1  2  NaN  3.0

Therefore, you could build your DataFrame like this:

data = []
for n in range(20000):
    df_row = pd.Series(...)
    data.append(df_row)

df = pd.DataFrame(data)

This is more efficient than calling df.append inside the for-loop, since that leads to quadratic copying.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

I believe you are looking for the merge function!

Try it out as df_all.merge(df_row, how='outer'), that should do the job.

hmhmmm
  • 333
  • 1
  • 4