Drop row if column entry contains NaN

Question

I have a series s that has entries that are lists, for example [1, 2, 3, NaN, NaN] or [4, 5]. These lists may contain NaNs as the last few elements, and I want to drop all entires in this series that contain NaN. I have so far used s.transform(lambda x: np.nan if np.isnan(x).any() else x).dropna(), but this takes over a minute on just 21 million rows, and I am eventually planning on doing this with tens of billions of rows, so I need something fast. Thank you!

To emphasize, each entry in the series is a list, and so I cannot just use pd.dropna() because there are no entries that are NaN since are lists themselves. I want to delete the lists (entries) that CONTAIN NaN. This is what the series s might look like: pd.Series([1, 2, 3, NaN, NaN], [4, 5]...).

Added a updated solution after your update to question. does that answer your question? — Naveed, Jun 20 '22 at 20:00

score 1 · Accepted Answer · answered Jun 20 '22 at 20:06

You can identify all index positions that are equal to NaN for the exploded data frame and can then filter the data frame for those that are not in the index array:

ser = pd.DataFrame(data={"col": [[1, 2, 3, np.nan, np.nan], [3, 4, 5], [3, 9], [np.nan, 10]]})['col'] 

ser_exploded = ser.explode()
ser[~ser.index.isin(np.unique(ser_exploded[ser_exploded.isna()].index))]

--------------------------------------
1    [3, 4, 5]
2       [3, 9]
Name: col, dtype: object
--------------------------------------

Corralien · Answer 2 · 2022-06-20T20:18:58.117

An alternative with multiprocessing:

import pandas as pd
import numpy as np
import multiprocessing as mp
import time

def check_nan(s):
    return s.explode().isna().groupby(level=0).max()

if __name__ == '__main__':  # Do not remove this line! Mandatory
    # Setup a minimal reproducible example
    N = 10_000_000
    s = pd.Series([[1, 2, 3, np.NaN, np.NaN], [4, 5]]).repeat(N)
    s = s.sample(frac=1, ignore_index=True)

    CHUNKSIZE = 10_000
    start = time.time()
    with mp.Pool(mp.cpu_count() - 1) as p:
        results = p.map(check_nan, (s[i:i+CHUNKSIZE] for i in range(0, len(s), CHUNKSIZE)))
    m = pd.concat(results)
    s = s[~m]
    end = time.time()
    print(f"Elapsed time: {end - start:.2f} seconds")

For 20,000,000 records:

[...]$ python mp.py
Elapsed time: 1.58 seconds

Note: without mp, the execution time is 6.07 seconds:

start = time.time()
m = s.explode().isna().groupby(level=0).max()
s1 = s[~m]
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")

How would I use the multiprocessed version in a Jupyter notebook? Can't seem to get the `if __name__` part to work in a cell where the variables are all already loaded in. — Tanishq Kumar, Jun 23 '22 at 20:56
Refer to this [answer](https://stackoverflow.com/a/62935690/15239951). Change `if __name__ == '__main__'` by a function. — Corralien, Jun 24 '22 at 03:39

Naveed · Answer 3 · 2022-06-20T19:59:05.317

0

you convert it into a DF, explode, then drop all NA and finally concatenate it back to the list, as the original series

s.to_frame().reset_index().explode(0).dropna().groupby('index')[0].agg(list)

INPUT

0    [1, 2, 3, nan, nan]
1                 [4, 5]

RESULT

0    [1, 2, 3]
1       [4, 5]

edited Jun 20 '22 at 19:59

answered Jun 20 '22 at 19:37

Naveed

11,495
2
14
21

Updated question to be clearer. – Tanishq Kumar Jun 20 '22 at 19:38
@TanishqKumar, updated the response. does it work for you? – Naveed Jun 20 '22 at 20:01
@TanishqKumar, you need to drop the NaN or the whole entry? – Naveed Jun 20 '22 at 20:04
The whole entry! – Tanishq Kumar Jun 23 '22 at 19:43

score 0 · Answer 4 · 2022-06-20T20:14:53.423

0

If I understood your question properly, filtering rows using a mask which tests for NaN values should work

import pandas as pd
from numpy import nan as NaN
s = pd.Series([[1, 2, 3, NaN, NaN], [4, 5]])
s = s[~s.apply(lambda list1: any(pd.isna(x) for x in list1))]
print(s)

1    [4, 5]
dtype: object

edited Jun 20 '22 at 20:14

answered Jun 20 '22 at 19:54

score 0 · Answer 5 · answered Jun 20 '22 at 19:56

0

Assuming this input:

from numpy import NaN
s = pd.Series([[1, 2, 3, NaN, NaN], [4, 5]])

You can use:

s2 = s[s.explode().notna().groupby(level=0).all()]

Or, with a list comprehension:

s2 = s[[pd.Series(x).notna().all() for x in s]]

output:

1    [4, 5]
dtype: object

answered Jun 20 '22 at 19:56

mozway

194,879
13
39
75

The first solution gives the error `IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)`. – Tanishq Kumar Jun 21 '22 at 12:43
@TanishqKumar can you provide a reproducible example that raises this error? – mozway Jun 21 '22 at 14:38

Drop row if column entry contains NaN

5 Answers5