5

I have the following problem: in a df, I want to select specific rows and a specific column and in this selection take the first n elements and assign a new value to them. Naively, I thought that the following code should do the job:

import seaborn as sns
import pandas as pd

df = sns.load_dataset('tips')
df.loc[df.day=="Sun", "smoker"].iloc[:4] = "Yes"

Both of the loc and iloc should return a view into the df and the value should be overwritten. However, the dataframe does not change. Why?

I know how to go around it -- creating a new df first just with the loc, then changing the value using iloc and updating back the original df (as below).

But a) I do not think it's optimal, and b) I would like to know why the top solution does not work. Why does it return a copy and not a view of a view?

The alternative solution:

df = sns.load_dataset('tips')
tmp = df.loc[df.day=="Sun", "smoker"]
tmp.iloc[:4] = "Yes"
df.loc[df.day=="Sun", "smoker"] = tmp

Note: I have read the docs, this really great post and this question but they don't explain this. Their concern is the difference between df.loc[mask,"z] and the chained df["z"][mask].

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
My Work
  • 2,143
  • 2
  • 19
  • 47
  • have you tried with: ```df.loc[df.day=="Sun", "smoker"].iloc[:4] = ["Yes"] * 4```? – 99_m4n Jun 10 '21 at 07:08
  • Yes, that is not the problem. It does not matter if you add a list of values which has the right length or if you use a single string (the list would allow you to set different values to different rows, the string sets it the same for all rows). – My Work Jun 10 '21 at 07:15

2 Answers2

4

I believe df.loc[].iloc[] is a chained assignment case and pandas doesn't guarantee that you will get a view at the end. From the docs:

Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided.

Since you have a filtering condition in loc, pandas will create a new pd.Series and than will apply an assignment to it. For example the following will work because you'll get the same series as df["smoker"]:

df.loc[:, "smoker"].iloc[:4] = 'Yes'

But you will get SettingWithCopyWarning warning.

You need to rewrite your code so that pandas handles this as a single loc entity.

Another possible workaround:

df.loc[df[df.day=="Sun"].index[:4], "smoker"] = 'Yes'
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Alexander Volkovsky
  • 2,588
  • 7
  • 13
  • Hi, thanks for the answer. However, I am almost positive that this does not work in general. When you use the `df.index`, it uses the entire dataframe, not the selection made by the loc. So although in this example it does not make a difference (sorry for choosing a bad one..), in general you would need the index order, not the index itself. – My Work Jun 10 '21 at 08:00
  • Yep, that workaround works. It's nice, I like it. I'd still wait with accepting your answer as the answer if someone would come with an explanation for why it's happening, if they'd have more detailed then yours. But thanks! – My Work Jun 10 '21 at 08:20
  • added some explanation – Alexander Volkovsky Jun 10 '21 at 08:43
0

In your case, you can define the columns to impute

Let's suppose the following dataset

df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10], 
                         'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Nellore", "Guwahati", "Nellore", "Numaligarh", "Sibsagar", "Munger-Jamalpu"], 
                         'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
                         'Apr-21': [121.1, 118.3, 131.5, 124.5, 128.2, 128.2, 115.4, 115.1, 117.3, 118.3]})
df
    State   Sno Center      Mar-21  Apr-21
0   1       Guntur          121.0   121.1
1   2       Nellore         118.8   118.3
2   3       Visakhapatnam   131.6   131.5
3   4       Biswanath       123.7   124.5
4   5       Nellore         127.8   128.2
5   6       Guwahati        125.9   128.2
6   7       Nellore         114.2   115.4
7   8       Numaligarh      114.2   115.1
8   9       Sibsagar        117.7   117.3
9   10      Munger-Jamalpu  117.7   118.3

So, I would like to change to 0 all dates where Sno Center is equals to Nellore

mask = df["Sno Center"] == "Nellore"
df.loc[mask, ["Mar-21", "Apr-21"]] = 0

The result

df
State   Sno Center      Mar-21  Apr-21
0   1   Guntur          121.0   121.1
1   2   Nellore         0.0     0.0
2   3   Visakhapatnam   131.6   131.5
3   4   Biswanath       123.7   124.5
4   5   Nellore         0.0     0.0
5   6   Guwahati        125.9   128.2
6   7   Nellore         0.0     0.0
7   8   Numaligarh      114.2   115.1
8   9   Sibsagar        117.7   117.3
9   10  Munger-Jamalpu  117.7   118.3

Other option is to define the columns as a list

COLS = ["Mar-21", "Apr-21"]
df.loc[mask, COLS] = 0

Other options using iloc

COLS = df.iloc[:, 2:4].columns.tolist()
df.loc[mask, COLS] = 0

Or

df.loc[mask, df.iloc[:, 2:4].columns.tolist()] = 0
Samir Hinojosa
  • 825
  • 7
  • 24
  • Hi, thanks for your answer. This is, however, not answering my problem. A) it solves a different problem (you're not selecting the first `n` dates in the df where `Sno Center` is `Nellore`, B) I'm still missing the explanation for why the chained assignment in my post returns a copy and not a view. – My Work Jun 10 '21 at 08:13
  • That would not work either, it will return the entire df. There is a fix for that, `df.loc[mask.index, ["Mar-21", "Apr-21"]] = 0` ie to use the index, but then it's just a more wordy solution of @Alexander Volkovsky. – My Work Jun 10 '21 at 08:27