15

I have a DataFrame containing many NaN values. I want to delete rows that contain too many NaN values; specifically: 7 or more.

I tried using the dropna function several ways but it seems clear that it greedily deletes columns or rows that contain any NaN values.

This question (Slice Pandas DataFrame by Row), shows me that if I can just compile a list of the rows that have too many NaN values, I can delete them all with a simple

df.drop(rows)

I know I can count non-null values using the count function which I could them subtract from the total and get the NaN count that way (Is there a direct way to count NaN values in a row?). But even so, I am not sure how to write a loop that goes through a DataFrame row-by-row.

Here's some pseudo-code that I think is on the right track:

### LOOP FOR ADDRESSING EACH row:
    m = total - row.count()
    if (m > 7):
        df.drop(row)

I am still new to Pandas so I'm very open to other ways of solving this problem; whether they're simpler or more complex.

smci
  • 32,567
  • 20
  • 113
  • 146
Slavatron
  • 2,278
  • 5
  • 29
  • 40
  • 2
    There is a `thresh` param to specify the minimum number of non-NA values: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html have you tried this? – EdChum Aug 05 '14 at 19:07
  • I had not noticed that, thank you. It suits my needs perfectly. – Slavatron Aug 05 '14 at 19:12
  • 2
    df.dropna(thresh=3) was all I needed (there are 9 columns in the dataframe) – Slavatron Aug 05 '14 at 19:25
  • I thought I'd put a dynamic method in my answer in the case where you don't the number of columns, glad I could help – EdChum Aug 05 '14 at 19:26

2 Answers2

15

Basically the way to do this is determine the number of cols, set the minimum number of non-nan values and drop the rows that don't meet this criteria:

df.dropna(thresh=(len(df) - 7))

See the docs

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 4
    I had to use len(df.columns) instead of len(df). Worked like a charm. – thecircus Sep 01 '15 at 15:26
  • 2
    Doesn't axis=1 tells it to drop columns? At least in my case columns get deleted when I choose axis=1 – xkcd Feb 22 '16 at 17:35
  • @xkcd it depends on the function, in this case it's the opposite – EdChum Feb 22 '16 at 17:48
  • `axis=1` will drop the columns, not the rows. "{0 or ‘index’, 1 or ‘columns’}" straight from the docs. – Paul English Jul 14 '16 at 19:07
  • @PaulEnglish You're correct, I'm not sure if this was due to an error in the docs historically or if I was confusing this with `drop` which does flip the expected meaning of `axis`, will update and thanks for pointing this out – EdChum Jul 15 '16 at 08:46
5

The optional thresh argument of df.dropna lets you give it the minimum number of non-NA values in order to keep the row.

df.dropna(thresh=df.shape[1]-7)
Roger Fan
  • 4,945
  • 31
  • 38
  • 1
    `df.dropna(thresh=2, , inplace=True) # drop extra lines w/o 2 valid values` this was a little more simple and worked perfectly for my application. – Jeff Bluemel May 09 '19 at 20:36