1

I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None 

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None 

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.

David
  • 300
  • 1
  • 14
  • 1
    I think the simplest solution is `df.Column.notnull()` as `None` is a null value recognized by pandas. Though I'm unsure as to why the element-wise comparison with `None` fails here. Using `.values` works `df.Column.values != None` – ALollz Nov 12 '18 at 17:23
  • 1
    you can't make use of `dropna()` ? – hootnot Nov 12 '18 at 17:39
  • @hootnot That is what I am using now :) – David Nov 26 '18 at 02:17

1 Answers1

3

Here's a reproducible example of what you're seeing:

x = pd.Series([1, None, 3, None, None])

print(x != None)

0    True
1    True
2    True
3    True
4    True
dtype: bool

What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

print(x)

0    1.0
1    NaN
2    3.0
3    NaN
4    NaN
dtype: float64

The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.

For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

print(pd.to_numeric(x, errors='coerce').notnull())

0     True
1    False
2     True
3    False
4    False
dtype: bool
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Though, even if the `Series` isn't initially converted `x = pd.Series(['1', None, 'hello', None, None])`, there must still be some conversion happening during the comparison? – ALollz Nov 12 '18 at 17:34
  • 2
    @ALollz, Yup, seems so. I haven't dug into the source for `pd.Series.__eq__`, my instinct is there's custom logic & edge cases [which also explains why using NumPy array for comparisons is faster]. Best to avoid all this and use `pd.to_numeric`. – jpp Nov 12 '18 at 17:36
  • @jpp Thank you very much! This is working well for me :) Regarding pd.to_numeric, why do you use that with 'coerce' rather than 'raise'? Is it because you expect and are okay with a few bad values in your rows (but don't want to analysis/application to throw an exception)? If correctness is your main concern, wouldn't you want to use 'raise', since NaN rows will affect your analysis without making you aware of it? – David Nov 19 '18 at 22:38
  • 1
    @David, It's up to you. I make the assumption your data is, or should be clean. If it's not, you can use `'raise'`. – jpp Nov 19 '18 at 23:58