8

Given that I have a pandas Series, I want to fill the NaNs with zero if either all the values are NaN or if all the values are either zero or NaN.

For example, I would want to fill the NaNs in the following Series with zeroes.

0       0
1       0
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

But, I would not want to fillna(0) the following Series:

0       0
1       0
2       2
3       0
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

I was looking at the documentation and it seems like I could use pandas.Series.value_counts to ensure the values are only 0 and NaN, and then simply call fillna(0).In other words, I am looking to check if set(s.unique().astype(str)).issubset(['0.0','nan']), THEN fillna(0), otherwise do not.

Considering how powerful pandas is, it seemed like a there may be a better way to do this. Does anyone have any suggestions to do this cleanly and efficiently?

Potential solution thanks to cᴏʟᴅsᴘᴇᴇᴅ

if s.dropna().eq(0).all():
    s = s.fillna(0)
smci
  • 32,567
  • 20
  • 113
  • 146
Lien
  • 127
  • 7

2 Answers2

8

You can compare by 0 and isna if only NaNs and 0 and then fillna:

if ((s == 0) | (s.isna())).all():
    s = pd.Series(0, index=s.index)

Or compare unique values:

if pd.Series(s.unique()).fillna(0).eq(0).all():
    s = pd.Series(0, index=s.index)

@cᴏʟᴅsᴘᴇᴇᴅ solution, thank you - compare Series without NaNs with dropna:

 if s.dropna().eq(0).all():
    s = pd.Series(0, index=s.index)

Solution from question - need convert to strings, because problem with compare with NaNs:

if set(s.unique().astype(str)).issubset(['0.0','nan']):

    s = pd.Series(0, index=s.index)

Timings:

s = pd.Series(np.random.choice([0,np.nan], size=10000))

In [68]: %timeit ((s == 0) | (s.isna())).all()
The slowest run took 4.85 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 574 µs per loop

In [69]: %timeit pd.Series(s.unique()).fillna(0).eq(0).all()
1000 loops, best of 3: 587 µs per loop

In [70]: %timeit s.dropna().eq(0).all()
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 774 µs per loop

In [71]: %timeit set(s.unique().astype(str)).issubset(['0.0','nan'])
The slowest run took 5.78 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 157 µs per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I was looking to only fillna(0) if ALL the values in the series are either NaN or 0. – Lien Apr 23 '18 at 05:27
  • @Lien - Yes, so `fillna(0)` is what need. But not sure if understand, what happens if there are not only `0` and `NaN`s in some Series? Then what need? – jezrael Apr 23 '18 at 05:28
  • I think he's trying to say that if s.unique() == [nan, 0], THEN fillna(0), otherwise no – cs95 Apr 23 '18 at 05:29
  • 1
    @Lien you may try something like `if s.dropna().eq(0).all(): s = s.fillna(0)` or something along those lines – cs95 Apr 23 '18 at 05:31
2

Create a mask for the null values. Check if the length of the mask equals the length of the series (in which case the series are either all null values or empty) or if the non-nulls are all equal to zero. If so, create a new series of zero values using the original index from the series.

nulls = s.isnull()
if len(nulls) == len(s) or s[~nulls].eq(0).all():
    s = pd.Series(0, index=s.index)

TIMINGS

%%timeit s_ = pd.concat([s] * 100000)
nulls = s_.isnull()
if len(nulls) == len(s_) or s_[~nulls].eq(0).all():
    s_ = pd.Series(0, index=s_.index)
# 100 loops, best of 3: 2.33 ms per loop

# OP's solution:
%%timeit s_ = pd.concat([s] * 100000)
if s_.dropna().eq(0).all():
    s_ = s_.fillna(0)
# 10 loops, best of 3: 19.7 ms per loop

# @Jezrael's fastest solution:
%%timeit s_ = pd.concat([s] * 100000)
if set(s_.unique().astype(str)).issubset(['0.0','nan']):
    s_ = pd.Series(0, index=s_.index)
# 1000 loops, best of 3: 4.58 ms per loop
Alexander
  • 105,104
  • 32
  • 201
  • 196