5

I am doing a Kaggle tutorial for Titanic using the Datacamp platform.

I understand the use of .loc within Pandas - to select values by row using column labels...

My confusion comes from the fact that in the Datacamp tutorial, we want to locate all the "Male" inputs within the "Sex" column, and replace it with the value of 0. They use the following piece of code to do it:

titanic.loc[titanic["Sex"] == "male", "Sex"] = 0

Can someone please explain how this works? I thought .loc took inputs of row and column, so what is the == for?

Shouldn't it be:

titanic.loc["male", "Sex"] = 0

Thanks!

fashioncoder
  • 79
  • 2
  • 8

1 Answers1

5

It set column Sex to 1 if condition is True only, another values are untouched:

titanic["Sex"] == "male"

Sample:

titanic = pd.DataFrame({'Sex':['male','female', 'male']})
print (titanic)
      Sex
0    male
1  female
2    male

print (titanic["Sex"] == "male")
0     True
1    False
2     True
Name: Sex, dtype: bool

titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
print (titanic)

0       0
1  female
2       0

It is very similar by boolean indexing with loc - it select only values of column Sex by condition:

print (titanic.loc[titanic["Sex"] == "male", "Sex"])
0    male
2    male
Name: Sex, dtype: object

But I think here better is use map if only male and female values need convert to some another values:

titanic = pd.DataFrame({'Sex':['male','female', 'male']})
titanic["Sex"] = titanic["Sex"].map({'male':0, 'female':1})
print (titanic)
   Sex
0    0
1    1
2    0

EDIT:

Primary loc is used for set new value by index and columns:

titanic = pd.DataFrame({'Sex':['male','female', 'male']}, index=['a','b','c'])
print (titanic)
      Sex
a    male
b  female
c    male

titanic.loc["a", "Sex"] = 0
print (titanic)
      Sex
a       0
b  female
c    male

titanic.loc[["a", "b"], "Sex"] = 0
print (titanic)
    Sex
a     0
b     0
c  male
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252