89

I have a large dataframe in pandas that apart from the column used as index is supposed to have only numeric values:

df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

How can I find the row of the dataframe df that has a non-numeric value in it?

In this example it's the fourth row in the dataframe, which has the string 'bad' in the a column. How can this row be found programmatically?

Zero
  • 74,117
  • 18
  • 147
  • 154

7 Answers7

92

You could use np.isreal to check the type of each element (applymap applies a function to each element in the DataFrame):

In [11]: df.applymap(np.isreal)
Out[11]:
          a     b
item
a      True  True
b      True  True
c      True  True
d     False  True
e      True  True

If all in the row are True then they are all numeric:

In [12]: df.applymap(np.isreal).all(1)
Out[12]:
item
a        True
b        True
c        True
d       False
e        True
dtype: bool

So to get the subDataFrame of rouges, (Note: the negation, ~, of the above finds the ones which have at least one rogue non-numeric):

In [13]: df[~df.applymap(np.isreal).all(1)]
Out[13]:
        a    b
item
d     bad  0.4

You could also find the location of the first offender you could use argmin:

In [14]: np.argmin(df.applymap(np.isreal).all(1))
Out[14]: 'd'

As @CTZhu points out, it may be slightly faster to check whether it's an instance of either int or float (there is some additional overhead with np.isreal):

df.applymap(lambda x: isinstance(x, (int, float)))
Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 6
    Great! Only that `df.applymap(lambda x: isinstance(x, (int, float)))` will be about 18% faster. Using `lambda` with build in method is sometimes reasonably fast. – CT Zhu Feb 14 '14 at 06:28
  • @CTZhu Good point. I guess that's not soo suprising (since np.isreal does some other stuff too e.g. dealing with arrays not just single values). – Andy Hayden Feb 14 '14 at 06:30
  • don't forget that if you have an field (like a date or groupname) you'll want to do this only on that subset of columns of the dataframe – pashute Nov 07 '17 at 19:06
  • 2
    It's worth noting that `np.isreal(None) is not True` – MattY Apr 20 '18 at 07:55
  • 1
    How would you only check one column for values that aren't numbers? – Superdooperhero Jul 24 '19 at 17:59
  • 8
    @Superdooperhero use .apply on the column rather than .applymap on the DataFrame, i.e. `df["a"].apply(np.isreal)` and `df[~df["a"].apply(np.isreal)]` – Andy Hayden Jul 24 '19 at 18:37
  • 1) carefull: np.isreal will convert 'str' to False a=pd.Series(['1', 2, 3, 'bad', 5]) -> False, True, True, False, True 2) carefull: a.str.isnumeric() will convert 'num' to NaN a -> True, NaN, NaN, False, NaN 3) if Series/Column use a.apply(np.isreal) (.applymap will not work) – Alexey K. Jan 05 '21 at 18:27
29

Already some great answers to this question, however here is a nice snippet that I use regularly to drop rows if they have non-numeric values on some columns:

# Eliminate invalid data from dataframe (see Example below for more context)

num_df = (df.drop(data_columns, axis=1)
         .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

num_df = num_df[num_df[data_columns].notnull().all(axis=1)]

The way this works is we first drop all the data_columns from the df, and then use a join to put them back in after passing them through pd.to_numeric (with option 'coerce', such that all non-numeric entries are converted to NaN). The result is saved to num_df.

On the second line we use a filter that keeps only rows where all values are not null.

Note that pd.to_numeric is coercing to NaN everything that cannot be converted to a numeric value, so strings that represent numeric values will not be removed. For example '1.25' will be recognized as the numeric value 1.25.

Disclaimer: pd.to_numeric was introduced in pandas version 0.17.0

Example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"item": ["a", "b", "c", "d", "e"],
   ...:                    "a": [1,2,3,"bad",5],
   ...:                    "b":[0.1,0.2,0.3,0.4,0.5]})

In [3]: df
Out[3]: 
     a    b item
0    1  0.1    a
1    2  0.2    b
2    3  0.3    c
3  bad  0.4    d
4    5  0.5    e

In [4]: data_columns = ['a', 'b']

In [5]: num_df = (df
   ...:           .drop(data_columns, axis=1)
   ...:           .join(df[data_columns].apply(pd.to_numeric, errors='coerce')))

In [6]: num_df
Out[6]: 
  item   a    b
0    a   1  0.1
1    b   2  0.2
2    c   3  0.3
3    d NaN  0.4
4    e   5  0.5

In [7]: num_df[num_df[data_columns].notnull().all(axis=1)]
Out[7]: 
  item  a    b
0    a  1  0.1
1    b  2  0.2
2    c  3  0.3
4    e  5  0.5
JoshVarty
  • 9,066
  • 4
  • 52
  • 80
Pedro M Duarte
  • 26,823
  • 7
  • 44
  • 43
  • This is what I use, for some reason top answer doesn't always work for me and ```convert_object``` is deprecated – skd Sep 17 '18 at 11:43
  • I do something similar, but like this float_cols=['a','b','c'] for cols in float_cols: df[cols]=df[cols].apply(pd.to_numeric, errors='coerce').fillna(0) This is particular to cases where I want the nan's to be zero, not nan. You could also leave out the fillna(0) bit. – bart cubrich Dec 19 '18 at 23:16
  • Does this drop/join logic scale to millions of rows? – ericOnline Jun 16 '20 at 01:12
13
# Original code
df = pd.DataFrame({'a': [1, 2, 3, 'bad', 5],
                   'b': [0.1, 0.2, 0.3, 0.4, 0.5],
                   'item': ['a', 'b', 'c', 'd', 'e']})
df = df.set_index('item')

Convert to numeric using 'coerce' which fills bad values with 'nan'

a = pd.to_numeric(df.a, errors='coerce')

Use isna to return a boolean index:

idx = a.isna()

Apply that index to the data frame:

df[idx]

output

Returns the row with the bad data in it:

        a    b
item          
d     bad  0.4
matiu
  • 7,469
  • 4
  • 44
  • 48
6

Sorry about the confusion, this should be the correct approach. Do you want only to capture 'bad' only, not things like 'good'; Or just any non-numerical values?

In[15]:
np.where(np.any(np.isnan(df.convert_objects(convert_numeric=True)), axis=1))
Out[15]:
(array([3]),)
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • This will pick up `'bad'`, but won't pick up a rogue string number, though, because the conversion will succeed.. – DSM Feb 14 '14 at 05:26
  • Or get into problems with coded strings representing valid numbers, like `'over_the_chart'`, representing the maximum allowed value for that variable. Probably not a safe way. I would rather approach the problem from the way to build the `dataframe` in the real world case. – CT Zhu Feb 14 '14 at 05:30
  • This depends whether you consider "4" bad or not, my gut feeling would be yes :S – Andy Hayden Feb 14 '14 at 06:16
  • ```convert_objects``` is deprecated – skd Sep 17 '18 at 11:49
5

I'm thinking something like, just give an idea, to convert the column to string, and work with string is easier. however this does not work with strings containing numbers, like bad123. and ~ is taking the complement of selection.

df['a'] = df['a'].astype(str)
df[~df['a'].str.contains('0|1|2|3|4|5|6|7|8|9')]
df['a'] = df['a'].astype(object)

and using '|'.join([str(i) for i in range(10)]) to generate '0|1|...|8|9'

or using np.isreal() function, just like the most voted answer

df[~df['a'].apply(lambda x: np.isreal(x))]
dyang
  • 73
  • 1
  • 5
3

In case you are working with a column with string values, you can use THE VERY USEFUL function series.str.isnumeric() like:

a = pd.Series(['hi','hola','2.31','288','312','1312', '0,21', '0.23'])

What i do is to copy that column to new column, and do a str.replace('.','') and str.replace(',','') then i select the numeric values. and:

a = a.str.replace('.','')
a = a.str.replace(',','') 
a.str.isnumeric()

Out[15]: 0 False 1 False 2 True 3 True 4 True 5 True 6 True 7 True dtype: bool

Good luck all!

Borja_042
  • 1,071
  • 1
  • 14
  • 26
  • why `2.31` is not numeric? – toto_tico May 26 '17 at 11:36
  • 1
    I think it is because it has a dot. What i do is to copy that column to new column, and do a str.replace('.','') and then i select the numeric values. I have recently faced that problem. – Borja_042 May 26 '17 at 13:00
  • 1
    @toto_tico it is because of a dot. as per https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isnumeric.html it "Check whether all characters in each string in the Series/Index are numeric" – kekert Jul 04 '17 at 21:47
  • 1
    @kekert, thanks, I forgot that, and there is actually no `isreal` (or similar) in str... @fra44, I think you should add the `str.replace` clarification directly in the answer because it can be missleading as it is... – toto_tico Jul 05 '17 at 07:29
-1

Did you convert your data using .astype() ?

All great comments above must solve 99% of the cases, but if you are still in trouble, please also check if you converted your data type.

Sometimes I force the data to type float16 to save memory. Using:

df[col] = df[col].astype(np.float16)

But this might silently break your code. So if you did any kind of data type transformation, double check for overflows. Disable the conversion and try again.

It worked for me!

Andre Goulart
  • 528
  • 2
  • 20