1

I have multiple data frames that I saved in a concatenated list like below. Each df represents a matrix.

my_df = pd.concat([df1, df2, df3, .....])

How do I sum all these dfs (matrices) into one df (matrix)?

I found a discussion here, but it only answers how to add two data frames, by using code like below.

df_x.add(df_y, fill_value=0)

Should I use the code above in a loop, or is there a more concise way?

I tried to do print(my_df.sum()) but got a very confusing result (it's suddenly turned into a one row instead of two-dimensional matrix).

Thank you.

jpp
  • 159,742
  • 34
  • 281
  • 339
catris25
  • 1,173
  • 3
  • 20
  • 40

3 Answers3

2

I believe need functools.reduce if each DataFrame in list have same index and columns values:

np.random.seed(2018)

df1 = pd.DataFrame(np.random.choice([1,np.nan,2], size=(3,3)), columns=list('abc'))
df2 = pd.DataFrame(np.random.choice([1,np.nan,3], size=(3,3)), columns=list('abc'))
df3 = pd.DataFrame(np.random.choice([1,np.nan,4], size=(3,3)), columns=list('abc'))
print (df1)
     a    b    c
0  2.0  2.0  2.0
1  NaN  NaN  1.0
2  1.0  2.0  NaN

print (df2)
     a    b    c
0  NaN  NaN  1.0
1  3.0  3.0  3.0
2  NaN  1.0  3.0

print (df3)
     a    b    c
0  4.0  NaN  NaN
1  4.0  1.0  1.0
2  4.0  NaN  1.0

from functools import reduce

my_df = [df1,df2, df3]

df = reduce(lambda x, y: x.add(y, fill_value=0), my_df)
print (df)
     a    b    c
0  6.0  2.0  3.0
1  7.0  4.0  5.0
2  5.0  3.0  4.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

I believe the idiomatic solution to this is to preserve the information about different DataFrames with the help of the keys parameter and then use sum on the innermost level:

dfs = [df1, df2, df3]
my_df = pd.concat(dfs, keys=['df{}'.format(i+1) for i in range(len(dfs))])
my_df.sum(level=1)

which yields

     a    b    c
0  6.0  2.0  3.0
1  7.0  4.0  5.0
2  5.0  3.0  4.0

with jezrael's sample DataFrames.

ayhan
  • 70,170
  • 20
  • 182
  • 203
1

One method is to use sum with a list of arrays. The output here will be an array rather than a dataframe.

This assumes you need to replace np.nan with 0:

res = sum([x.fillna(0).values for x in [df1, df2, df3]])

Alternatively, you can use numpy directly in a couple of different ways:

res_np1 = np.add.reduce([x.fillna(0).values for x in [df1, df2, df3]])

res_np2 = np.nansum([x.values for x in [df1, df2, df3]], axis=0)

numpy.nansum assumes np.nan equals zero for summing purposes.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • My data are integers (and none of them are nan, so it gets an error). Should I still use the `fillna(0)` part if there's no nan? – catris25 Apr 23 '18 at 11:40
  • 1
    @AnnaRG, In this case, it's not necessary to use `fillna` or `nansum`. I recommend `np.sum([x.values for x in [df1, df2, df3]], axis=0)` for readability. – jpp Apr 23 '18 at 11:42
  • Thanks. So, your answer doesn't require the dfs to be concatenated first, unlike @ayhan answer. May I know, why is it so? – catris25 Apr 23 '18 at 11:47
  • 2
    My answer drops down to `numpy`, while ayhan's answer sticks to `pandas`. There is more than one solution to this problem. If performance is not an issue, *use what makes most sense to you*. – jpp Apr 23 '18 at 11:48
  • I will, thanks. Out of curiosity, performance-wise, which should be the fastest solution? – catris25 Apr 23 '18 at 11:54
  • 1
    @AnnaRG, I expect a `numpy` solution to be fastest. But you should test this with your data. – jpp Apr 23 '18 at 11:55