1

I have two datasets. The first one (df1) contains more then 200.000 rows, and the second one (df2) only two. I need to create a new column df1['column_2'] which is the sum of df1['column_1'] and df2['column_1']

When I try to make df1['column_2'] = df1['column_1'] + df2['column_1'] I get an error "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"

How can I sum values of different datasets with different amount of rows? Will be thankful for any help! Screenshot of my notebook: https://prnt.sc/p1d6ze

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Stas
  • 53
  • 7
  • 2
    What do you mean by *sum of df1['column_1'] and df2['column_1']*? What do you expect the first item on `df1['column_2']` to be? – Quang Hoang Sep 03 '19 at 18:10
  • Please show sample data and code. – Mad Physicist Sep 03 '19 at 18:11
  • Unfourtunately, I can not add a screenshot, I need at least 10 reputation... I'll try to explain. I have a big dataset with clients. This dataframe has a column 'minutes_on_month' - total amount of minutes, which the clien spent during a month. The second dataframe has the describtion of tariff in the column 'minutes_included' it`s value, for example, 500. In fact I need to count extra_minutes for evety client – Stas Sep 03 '19 at 18:44
  • Redirecting to https://stackoverflow.com/questions/11106823/adding-two-pandas-dataframes. You can refer the reindex function of pandas. – Lovleen Kaur Sep 03 '19 at 18:46
  • @Stas but then, do you need to align the client's by the tariff before substracting? That is, clients `A,B,C` with tariff `x` should be subtracted by `m(x)` minutes, while clients `D,E` with tariff `y` should be subtracted by `m(y)`? – Quang Hoang Sep 03 '19 at 18:52
  • This is a slice from common table. In df_1 all clients use one tariff. – Stas Sep 03 '19 at 18:55
  • @LovleenKaur, and what about subtraction and other math actions? – Stas Sep 03 '19 at 19:03

3 Answers3

0

I tried your code and it works with no error, using Pandas 0.25.0 and Python 3.7.0. If you use older versions, consider upgrade.

For the test I used df1 with 10 rows (shorter):

   column_1
0        10
1        20
2        30
3        40
4        50
5        60
6        70
7        80
8        90
9       100

and df2 with 2 rows (just as in your post):

   column_1
0         3
1         5

Your instruction df1['column_2'] = df1['column_1'] + df2['column_1'] gives the following result:

   column_1  column_2
0        10      13.0
1        20      25.0
2        30       NaN
3        40       NaN
4        50       NaN
5        60       NaN
6        70       NaN
7        80       NaN
8        90       NaN
9       100       NaN

So that:

  • Elements with "overlapping" index values are summed.
  • Other elements (with no corresponding index in df2 are NaN.
  • Because of the presence of NaN values, this column is coerced to float.

Alternative form of this instruction, using .loc[...] is:

df1['column_2'] = df1.loc[:, 'column_1'] + df2.loc[:, 'column_1']

It works on my computer either.

Or maybe you want to "multiply" (replicate) df2 to the length of df1 before summing? If yes, run:

df1['column_2'] = df1.column_1 + df2.column_1.values.tolist() * 5

In this case 5 is the number of times df2 should be "multiplied".

This time no index alignment takes place and the result is:

   column_1  column_2
0        10        13
1        20        25
2        30        33
3        40        45
4        50        53
5        60        65
6        70        73
7        80        85
8        90        93
9       100       105
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

Reindex is applied on the df which have less number of records compared to the other, For example here y

Subtraction:

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x - y.reindex_like(x).fillna(0)

Addition

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x + y.reindex_like(x).fillna(0)

Multiplication

import pandas as pd
import re
x = pd.DataFrame([(100,200),(300,400),(100,111)], columns=['a','b'])
y = pd.DataFrame([(1,2),(3,4)], columns=['a','b'])
z= x * y.reindex_like(x).fillna(1)
Lovleen Kaur
  • 342
  • 2
  • 9
0

I have discovered that I can not make df_1['column_3] = df_1['column_1] + df_1['column_2] if df_1 is a slice from original dataframe df. So, I have solved my question by writing a function:

def new_column(dataframe):
    if dataframe['column']=='value_1':
   dataframe['new_column'] =(dataframe['column_1']
   - df_2[df_2['column']=='value_1']
      ['column_1'].values[0])
    else:
       dataframe['new_column'] =(dataframe['column_1']
       - df_2[df_2['column']=='value_2']
          ['column_1'].values[0])
    return dataframe
dataframe=df_1.apply(new_column,axis=1)
Stas
  • 53
  • 7