remove entries with nan values in python dictionary

Question

I have the foll. dictionary in python:

OrderedDict([(30, ('A1', 55.0)), (31, ('A2', 125.0)), (32, ('A3', 180.0)), (43, ('A4', nan))])

Is there a way to remove the entries where any of the values is NaN? I tried this:

{k: dict_cg[k] for k in dict_cg.values() if not np.isnan(k)}

It would be great if the soln works for both python 2 and python 3

@Aemyl, removing the `.values()` does not help. It does not remove the offending entries — user308827, Jun 26 '18 at 05:33
I have to admit that I didn't really read the definition of `dict_cg` and assumed it was just a dict with value `nan` in it — Aemyl, Jun 26 '18 at 05:35
@RoadRunner, in this case the nan is in the value of the dictionary. and the value is a tuple — user308827, Jun 26 '18 at 05:36
@RoadRunner why should this be impossible? I mean maybe you can't use `nan` as a key but certainly as a value — Aemyl, Jun 26 '18 at 05:37

cs95 · Accepted Answer · 2018-06-26T05:40:01.663

Since you have pandas, you can leverage pandas' pd.Series.notnull function here, which works with mixed dtypes.

>>> import pandas as pd
>>> {k: v for k, v in dict_cg.items() if pd.Series(v).notna().all()}
{30: ('A1', 55.0), 31: ('A2', 125.0), 32: ('A3', 180.0)}

This is not part of the answer, but may help you understand how I've arrived at the solution. I came across some weird behaviour when trying to solve this question, using pd.notnull directly.

Take dict_cg[43].

>>> dict_cg[43]
('A4', nan)

pd.notnull does not work.

>>> pd.notnull(dict_cg[43])
True

It treats the tuple as a single value (rather than an iterable of values). Furthermore, converting this to a list and then testing also gives an incorrect answer.

>>> pd.notnull(list(dict_cg[43]))
array([ True,  True])

Since the second value is nan, the result I'm looking for should be [True, False]. It finally works when you pre-convert to a Series:

>>> pd.Series(dict_cg[43]).notnull() 
0     True
1    False
dtype: bool

So, the solution is to Series-ify it and then test the values.

Along similar lines, another (admittedly roundabout) solution is to pre-convert to an object dtype numpy array, and pd.notnull will work directly:

>>> pd.notnull(np.array(dict_cg[43], dtype=object))
Out[151]: array([True,  False])

I imagine that pd.notnull directly converts dict_cg[43] to a string array under the covers, rendering the NaN as a string "nan", so it is no longer a "null" value.

This is a clean answer, although the original code had Numpy, but not Pandas. For a situation where only Numpy is required, this is a bit heavy. — Grismar, Jun 26 '18 at 06:10
@Grismar I see where you're coming from, and yes, a clean answer requires some sacrifices. The alternative would've been nested iteration is shown by the answer below. — cs95, Jun 26 '18 at 06:19

Ash Sharma · Answer 2 · 2018-06-28T04:25:51.950

4

This should work:

for k,v in dict_cg.items():
    if np.isnan(v[1]):
       dict_cg.pop(k)
print dict_cg

Output:

OrderedDict([(30, ('A1', 55.0)), (31, ('A2', 125.0)), (32, ('A3', 180.0))])

edited Jun 28 '18 at 04:25

answered Jun 26 '18 at 05:44

Ash Sharma

470
3
18

It checks for `nan` values in the dictionary and `pop`s the key out. – Ash Sharma Jun 26 '18 at 05:47
No, it doesn't. At least, it isn't guaranteed to. – cs95 Jun 26 '18 at 05:48
2

I know it'll work. But it isn't guaranteed to work. Doing an in check with nan isn't always guaranteed because the comparison is on the id, not the value. Two NaN values do not necessarily have to have the same id. – cs95 Jun 26 '18 at 05:50
I tried it on my machine and it did. I have posted the output as well. Could you please post the output from your attempt.? – Ash Sharma Jun 26 '18 at 05:50
1

Read my comment above. I said it will work, but it isn't guaranteed to work all the time. All I'm trying to say is using NaN for performing in checks is _bad form_. – cs95 Jun 26 '18 at 05:51
@coldspeed, "NaN for performing in checks is bad form" any reference to that please? just curious to learn. – Sufiyan Ghori Jun 26 '18 at 05:59
@SufiyanGhori check this : https://stackoverflow.com/questions/20320022/why-in-numpy-nan-nan-is-false-while-nan-in-nan-is-true – Ash Sharma Jun 26 '18 at 06:00
@coldspeed thank you for that and educating me about comparison with `NaNs` – Ash Sharma Jun 26 '18 at 06:03
Ash: A little code review: the OD value is a (label, value) tuple and you are checking both for NaN - doesn't make sense and doesn't actually work. If I copy this into a 2.7.15 python (2.x indicated by your `print` stmt), I get: `TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''`. If I change from `numpy.isnan` to `math.isnan`, I get: `TypeError: a float is required`. – Steve Tarver Jun 26 '18 at 15:39
@SteveTarver I have modified the code to remove this error. At this point, your answer is more pythonic :) – Ash Sharma Jun 27 '18 at 12:08
1

@Ash Sharma, your answer is the only one that actually answers the user's question: to remove offending entries. One usually chooses an OrderedDict for a reason and both coldspeed and I convert that OrderedDict to a dict during processing. Btw, your indention is off. – Steve Tarver Jun 27 '18 at 15:10

Steve Tarver · Answer 3 · 2018-06-29T16:06:23.623

user308827,

The code in your question seems to confuse keys and values and ignore the fact that your values are tuples. Here's a one liner using std libs and a dict comprehension that works in python 2,3:

from collections import OrderedDict
import math

od = OrderedDict([(30, ('A1', 55.0)), (31, ('A2', 125.0)), (32, ('A3', 180.0)), (43, ('A4', float('Nan')))])

no_nans = OrderedDict({k:v for k, v in od.items() if not math.isnan(v[1])})
# OrderedDict([(30, ('A1', 55.0)), (31, ('A2', 125.0)), (32, ('A3', 180.0))])

score 1 · Answer 4 · answered Jun 26 '18 at 06:04

Your original code didn't actually have pandas and importing it just to filter for NaN seems excessive. However, your code was using numpy (np).

Assuming your first line should read:

dict_cg = OrderedDict([(30, ('A1', 55.0)), (31, ('A2', 125.0)), (32, ('A3', 180.0)), (43, ('A4', np.nan))])

This line is close to what you had and works, although it requires you import the default library numbers:

OrderedDict([(k, vs) for k, vs in d.items() if not any ([isinstance(v, numbers.Number) and np.isnan(v) for v in vs])])

This way, you don't need pandas, your result is still an OrderedDict (as you had before) and you don't run into problems with the strings in the tuples, since conditions around and are evaluated left to right.

Note that, for Python 2, you'd probably need to replace ```.items()``` with ```.iteritems()``` — Grismar, Jun 26 '18 at 06:05

remove entries with nan values in python dictionary

4 Answers4

Linked