2

Why does

>> import pandas as pd
>> import numpy as np

>> list(pd.Series([np.nan, np.nan, 2, np.nan, 2])) == [np.nan, np.nan, 2, np.nan, 2]

return False? I get the same result with pd.Series([np.nan, np.nan, 2, np.nan, 2]).tolist(). I was trying to count the most common element in a pandas groupby object (so basically, a pandas Series) through the following function

def get_most_common(srs):
    """
    Returns the most common value in a list. For ties, it returns whatever
    value collections.Counter.most_common(1) gives.
    """
    from collections import Counter

    x = list(srs)
    my_counter = Counter(x)
    most_common_value = my_counter.most_common(1)[0][0]

    return most_common_value

and just realized that I get wrong counts for multiple NaNs even if I have a step x = list(srs).

EDIT: Just to clarify why this is an issue for me:

>>> from collections import Counter
>>> Counter(pd.Series([np.nan, np.nan, np.nan, 2, 2, 1, 5]).tolist())
Counter({nan: 1, nan: 1, nan: 1, 2.0: 2, 1.0: 1, 5.0: 1}) # each nan is counted differently
>>> Counter([np.nan, np.nan, np.nan, 2, 2, 1, 5])
Counter({nan: 3, 2: 2, 1: 1, 5: 1}) # nan count of 3 is correct
irene
  • 2,085
  • 1
  • 22
  • 36
  • Not sure if I understood your question. Is it not faster to do `df.groupby('some col').count()` instead? – r.ook Apr 08 '20 at 14:01
  • Hi @r.ook I edited my question above to clarify why this is an issue for me – irene Apr 08 '20 at 14:05
  • 1
    I understood *that part* of the question - and that's already answered. `nan != nan` so you can't quite handle it that way. As for the example in your clarification, try using `list(map(id, ...))` instead of `Counter`, you'll see why. The object reference is the same in the `list`, but when the `pd.Series` is created the `np.nan` are treated as different objects. What I am curious about is what are you actually trying to accomplish, because right now it sounds like an X-Y problem to me. – r.ook Apr 08 '20 at 14:13
  • @r.ook so what pandas does is to have a different reference for each NaN, while for lists, it's the same reference? Is that correct? Which is why converting the Series to a list doesn't convert all the NaNs to the same reference? – irene Apr 08 '20 at 14:17
  • That's right. If you run `pd.Series([np.nan, np.nan, np.nan, 2, 2, 1, 5]).apply(id)` you'll see the reference is already different inside the `Series`, so converting using `to_list()` will carry the difference. It seems the reason is builtin `list` uses the imported `np.nan` object reference directly, whereas pandas create its own copy of the objects for its `Series` - which makes sense for them to be different since `nan != nan`. Hence why it's more important to understand what you're trying to do instead of pigeonholing on the `nan` comparisons. – r.ook Apr 08 '20 at 14:23
  • @r.ook if you can write this as an answer, I'd be very happy to accept it. Thanks. – irene Apr 08 '20 at 14:35
  • Sure, I'll summarize my comments into an answer, make for easier reading. :) – r.ook Apr 08 '20 at 14:45

2 Answers2

3

The root issue, as @emilaz already stated, is that nan != nan in all cases. However, the object reference is what matters in your observation.

Observe the following object references between list and pd.Series:

>>> s = pd.Series([np.nan, np.nan, np.nan, 2, 2, 1, 5])
>>> s.apply(id)
0    149706480
1    202463472
2    202462336
3    149706912
4    149706288
5    149708784
6    149707200
dtype: int64

>>> l = [np.nan, np.nan, np.nan, 2, 2, 1, 5]
>>> list(map(id, l))
[68634768, 68634768, 68634768, 1389126848, 1389126848, 1389126832, 1389126896]

The np.nan object shares the same reference as the imported np.nan object in list, whereas a new reference is created for each Series (which makes sense for pandas usage).

The answer therefore is not to compare nan in such fashion. pandas have its own ways to deal with nan, so depending on your actual activity, there may be a much simpler answer (e.g. df.groupby('some col').count()) than you envisioned.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • Is it not faster for `srs.mode(dropna=False)`? I got `nan` as expected and much simpler. – r.ook Apr 08 '20 at 15:51
  • I deleted my previous comment. It was wrong. But pd.Series.mode is too slow for me. Hmm. – irene Apr 08 '20 at 15:57
  • How slow is too slow though? I'm not sure if there's a faster method since it's already vectorized IIRC. – r.ook Apr 08 '20 at 15:59
  • I'm trying to have an if-else condition where I use `srs.mode` when NaN is present and `Counter` otherwise. I've been unsuccessful though, even `srs.isnull().values.any()` is taking too much time. – irene Apr 08 '20 at 16:19
  • 1
    That's why I said it's an X-Y problem. It's better to just post what you are trying to do as a new question and get that answered there. Let others weigh in how you should take care of `nan`, because it doesn't seem your approach is getting the result you want. – r.ook Apr 08 '20 at 16:28
  • 1
    Also to be more specific - with `pandas`, you should always try to use the relevant pandas methods instead of relying on external modules/methods as the former will almost always be faster. – r.ook Apr 08 '20 at 16:30
  • Thanks. Maybe it deserves another question indeed. – irene Apr 08 '20 at 16:32
  • I posted another question plus my attempt. https://stackoverflow.com/questions/61105953/fastest-way-to-get-the-mode-of-a-pandas-series-with-nan – irene Apr 08 '20 at 16:57
2

In python, equating to nan always returns False. So the following behavior is expected:

import numpy as np
np.nan == np.nan
>>>> False

Which is why your list comparisons return False.

A possible workaround would be this:

import pandas as pd
import numpy as np

foo= list(pd.Series([np.nan, np.nan, 2, np.nan, 2]))
bar= [np.nan, np.nan, 2, np.nan, 2]

np.allclose(foo,bar, equal_nan=True)
>>>> True

This might interest you: comparing numpy arrays containing NaN.

For finding the most common element, I'd suggest using pandas and the value_counts() method:

pd.Series([np.nan, np.nan, 2, np.nan, 2]).value_counts()
>>>> 2.0  2

If you care about nan counts, you can simply pass dropna=False to the method:

pd.Series([np.nan, np.nan, 2, np.nan, 2]).value_counts()
>>>> NaN  3
     2.0  2
emilaz
  • 1,722
  • 1
  • 15
  • 31
  • Thanks. My concern though is that it causes a bug in my function. I've rewritten it to be `return srs.value_counts(dropna=False, sort=True, ascending=False).index[0])` though I'm not sure if this is the fastest way to do it. It gives the correct results though. – irene Apr 08 '20 at 13:58
  • I had previously tested this function on `[np.nan, np.nan, np.nan, 2, 2]` and it gives the right results (my fault, but still...it's annoying). – irene Apr 08 '20 at 13:59
  • I'd personally go with the np.allclose method, as it's the most readable and intuitive method for readers. If you come back to your code in two months (or someone else for that matter) you won't know what that line of code exactly does and if it's working correctly. – emilaz Apr 08 '20 at 13:59
  • see my edit for an answer to your counting question. – emilaz Apr 08 '20 at 14:06
  • I added an edit to clarify why this is an issue for me. Basically, converting the Series to a list is not equivalent to having a list. The `==` vs. `np.allclose` isn't that important: it's obvious from my bug that the np.nan behaves differently somehow from each other in the Series. The question though is why using Counter with a list is not the same as using Counter on a Series converted to a list. – irene Apr 08 '20 at 14:06