Retrieve indices of NaN values in a pandas dataframe

Question

I try to retrieve for each row containing NaN values all the indices of the corresponding columns.

d=[[11.4,1.3,2.0, NaN],[11.4,1.3,NaN, NaN],[11.4,1.3,2.8, 0.7],[NaN,NaN,2.8, 0.7]]
df = pd.DataFrame(data=d, columns=['A','B','C','D'])
print df

      A    B    C    D
0  11.4  1.3  2.0  NaN
1  11.4  1.3  NaN  NaN
2  11.4  1.3  2.8  0.7
3  NaN   NaN  2.8  0.7

I've already done the following :

add a column with the count of NaN for each row
get the indices of each row containing NaN values

What I want (ideally the name of the column) is get a list like this :

[ ['D'],['C','D'],['A','B'] ]

Hope I can find a way without doing for each row the test for each column

if df.ix[i][column] == NaN:

I'm looking for a pandas way to be able to deal with my huge dataset.

Thanks in advance.

I already found the indices of each row containing NaN values. What I want is the indices of the columns for each row. Sorry if the description was unclear. — dooms, Nov 10 '15 at 22:59

maxymoo · Answer 1 · 2015-11-11T00:43:46.930

5

It should be efficient to use a scipy coordinate-format sparse matrix to retrieve the coordinates of the null values:

import scipy.sparse as sp

x,y = sp.coo_matrix(df.isnull()).nonzero()
print(list(zip(x,y)))

[(0, 3), (1, 2), (1, 3), (3, 0), (3, 1)]

Note that I'm calling the nonzero method in order to just output the coordinates of the nonzero entries in the underlying sparse matrix since I don't care about the actual values which are all True.

edited Nov 11 '15 at 00:43

answered Nov 10 '15 at 23:12

maxymoo

35,286
11
92
119

this solution is 2 times faster (than @Alexander's) but I don't know how to navigate in this kind of data. it's not as easy as a list – dooms Nov 10 '15 at 23:29

score 3 · Accepted Answer · answered Nov 10 '15 at 23:30

3

Another way, extract the rows which are NaN:

In [11]: df_null = df.isnull().unstack()

In [12]: t = df_null[df_null]

In [13]: t
Out[13]:
A  3    True
B  3    True
C  1    True
D  0    True
   1    True
dtype: bool

This gets you most of the way and may be enough.
Although it may be easier to work with the Series:

In [14]: s = pd.Series(t2.index.get_level_values(1), t2.index.get_level_values(0))

In [15]: s
Out[15]:
0    D
1    C
1    D
3    A
3    B
dtype: object

e.g. if you wanted the lists (though I don't think you would need them)

In [16]: s.groupby(level=0).apply(list)
Out[16]:
0       [D]
1    [C, D]
3    [A, B]
dtype: object

answered Nov 10 '15 at 23:30

Andy Hayden

359,921
101
625
535

this is clearly the most "pandorable" way but 100 times slower than @maxymoo – dooms Nov 10 '15 at 23:45
@dooms do you mean doing the apply? – Andy Hayden Nov 10 '15 at 23:48
This is very surprising, mainly because the accepted answer is O(n^2). I can see the apply part being slow, but as I say I don't think you really need that part. – Andy Hayden Nov 11 '15 at 00:35
Your solution for my entire dataset was way faster than the first one. You're right about the complexity ! – dooms Nov 11 '15 at 01:26

score 3 · Answer 3 · answered Dec 03 '17 at 00:32

another simpler way is:

>>>df.isnull().any(axis=1)
0     True
1     True
2    False
3     True
dtype: bool

to subset:

>>> bool_idx = df.isnull().any(axis=1)
>>> df[bool_idx]
    A         B     C    D
0   11.4    1.3     2.0  NaN
1   11.4    1.3     NaN  NaN
3   NaN      NaN    2.8  0.7

to get integer index:

>>> df[bool_idx].index
Int64Index([0, 1, 3], dtype='int64')

Alexander · Answer 4 · 2015-11-10T23:26:44.170

1

You can iterate through each row in the dataframe, create a mask of null values, and output their index (i.e. the columns in the dataframe).

lst = []
for _, row in df.iterrows():
    mask = row.isnull()
    lst += [row[mask].index.tolist()]

>>> lst
[['D'], ['C', 'D'], [], ['A', 'B']]

edited Nov 10 '15 at 23:26

answered Nov 10 '15 at 23:14

Alexander

105,104
32
201
196

score 0 · Answer 5 · answered Oct 11 '21 at 11:42

0

Try to use :

s = df.isna().any()

it returns a series of boolean values indicates the columns have NaN values. The index is the column names.

Then you retrieve the NaN column(s) by using

s[s==True].index[0]

answered Oct 11 '21 at 11:42

MoussabOr

21
7

Retrieve indices of NaN values in a pandas dataframe

5 Answers5

Linked