How to sum columns from two different size datasets in pandas

Question

I have two datasets. The first one (df1) contains more then 200.000 rows, and the second one (df2) only two. I need to create a new column df1['column_2'] which is the sum of df1['column_1'] and df2['column_1']

When I try to make df1['column_2'] = df1['column_1'] + df2['column_1'] I get an error "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"

How can I sum values of different datasets with different amount of rows? Will be thankful for any help! Screenshot of my notebook: https://prnt.sc/p1d6ze

What do you mean by *sum of df1['column_1'] and df2['column_1']*? What do you expect the first item on `df1['column_2']` to be? — Quang Hoang, Sep 03 '19 at 18:10
Unfourtunately, I can not add a screenshot, I need at least 10 reputation... I'll try to explain. I have a big dataset with clients. This dataframe has a column 'minutes_on_month' - total amount of minutes, which the clien spent during a month. The second dataframe has the describtion of tariff in the column 'minutes_included' it`s value, for example, 500. In fact I need to count extra_minutes for evety client — Stas, Sep 03 '19 at 18:44
Redirecting to https://stackoverflow.com/questions/11106823/adding-two-pandas-dataframes. You can refer the reindex function of pandas. — Lovleen Kaur, Sep 03 '19 at 18:46
@Stas but then, do you need to align the client's by the tariff before substracting? That is, clients `A,B,C` with tariff `x` should be subtracted by `m(x)` minutes, while clients `D,E` with tariff `y` should be subtracted by `m(y)`? — Quang Hoang, Sep 03 '19 at 18:52
This is a slice from common table. In df_1 all clients use one tariff. — Stas, Sep 03 '19 at 18:55
@LovleenKaur, and what about subtraction and other math actions? — Stas, Sep 03 '19 at 19:03

Valdi_Bo · Answer 1 · 2019-09-03T19:33:52.070

I tried your code and it works with no error, using Pandas 0.25.0 and Python 3.7.0. If you use older versions, consider upgrade.

For the test I used df1 with 10 rows (shorter):

   column_1
0        10
1        20
2        30
3        40
4        50
5        60
6        70
7        80
8        90
9       100

and df2 with 2 rows (just as in your post):

   column_1
0         3
1         5

Your instruction df1['column_2'] = df1['column_1'] + df2['column_1'] gives the following result:

   column_1  column_2
0        10      13.0
1        20      25.0
2        30       NaN
3        40       NaN
4        50       NaN
5        60       NaN
6        70       NaN
7        80       NaN
8        90       NaN
9       100       NaN

So that:

Elements with "overlapping" index values are summed.
Other elements (with no corresponding index in df2 are NaN.
Because of the presence of NaN values, this column is coerced to float.

Alternative form of this instruction, using .loc[...] is:

df1['column_2'] = df1.loc[:, 'column_1'] + df2.loc[:, 'column_1']

It works on my computer either.

Or maybe you want to "multiply" (replicate) df2 to the length of df1 before summing? If yes, run:

df1['column_2'] = df1.column_1 + df2.column_1.values.tolist() * 5

In this case 5 is the number of times df2 should be "multiplied".

This time no index alignment takes place and the result is:

   column_1  column_2
0        10        13
1        20        25
2        30        33
3        40        45
4        50        53
5        60        65
6        70        73
7        80        85
8        90        93
9       100       105

Can I get an error if df_1 column has .int , and df_2 - float? — Stas, Sep 03 '19 at 18:48

Lovleen Kaur · Answer 2 · 2019-09-03T19:51:55.527

Reindex is applied on the df which have less number of records compared to the other, For example here y

Subtraction:

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x - y.reindex_like(x).fillna(0)

Addition

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x + y.reindex_like(x).fillna(0)

Multiplication

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x * y.reindex_like(x).fillna(1)

score 0 · Answer 3 · answered Sep 04 '19 at 08:08

I have discovered that I can not make df_1['column_3] = df_1['column_1] + df_1['column_2] if df_1 is a slice from original dataframe df. So, I have solved my question by writing a function:

def new_column(dataframe):
    if dataframe['column']=='value_1':
   dataframe['new_column'] =(dataframe['column_1']
   - df_2[df_2['column']=='value_1']
      ['column_1'].values[0])
    else:
       dataframe['new_column'] =(dataframe['column_1']
       - df_2[df_2['column']=='value_2']
          ['column_1'].values[0])
    return dataframe
dataframe=df_1.apply(new_column,axis=1)

How to sum columns from two different size datasets in pandas

3 Answers3