3

I see that there are many questions regarding str.contains and np.where, so sorry if its a duplicate. I just lost the overview.

I am wondering why the function str.contains inside of np.where produces a positive results when it is applied on np.NaN? (in the way that I am getting a 1, as if the string would contain the search-word)

df = pd.DataFrame({'A': ['Mouse', 'dog', 'cat', '23', np.NaN]})
df['B']=np.where(df.A.str.contains('og'),1,0)
print(df)

        A  B
 0  Mouse  0
 1    dog  1
 2    cat  0
 3     23  0
 4    NaN  1

I know that I can come to the right result by setting na=False as argument inside of str.contain. I am just wondering about the behavior and want to understand why the result comes up like that.

ZetDen
  • 155
  • 8

1 Answers1

4

The reason is that df.A.str.contains('og') evaluates to np.NAN for the NaN entry and np.NaN is Trueish. You can try that like

if np.NAN:
    print("This gets printed")

As np.where returns 1 in your case whenever the given condition evaluates to True, you get back a 1 where you have a NaN in the input.

Simon Hawe
  • 3,968
  • 6
  • 14
  • I'm not totally agree with your answer. Why `df.A.str.contains('og') == True` return False for `NaN` row? – Corralien Jan 25 '22 at 16:58
  • Because it is not equal. "aa" == True also return false. However if "aa": ... evaluates to true. That's why I have written is Trueish and not is True – Simon Hawe Jan 25 '22 at 16:59
  • 1
    @Corralien [this QA](https://stackoverflow.com/questions/15686318/why-do-not-a-number-values-equal-true-when-cast-as-boolean-in-python-numpy) gives a bit more info on this topic I think. and then when doing `df.A.str.contains('og').astype(bool)`, you get `True` for the `NaN` value, that is coherent with the result in `np.where`, that tries to cast the `condition` (here `df.A.str.contains('og')`) to `bool`, and not `condition==True` – Ben.T Jan 25 '22 at 17:19
  • 1
    Thanks for the link @Ben.T. – Corralien Jan 25 '22 at 17:21