get non numerical rows in a column pandas python

Question

I checked this post: finding non-numeric rows in dataframe in pandas? but it doesn't really answer my question.

my sample data:

import pandas as pd


d = {
 'unit': ['UD', 'UD', 'UD', 'UD', 'UD','UD'],
 'N-D': [ 'Q1', 'Q2', 'Q3', 'Q4','Q5','Q6'],
 'num' : [ -1.48, 1.7, -6.18, 0.25, 'sum(d)', 0.25]

}
df = pd.DataFrame(d)

it looks like this:

  N-D   num   unit
0  Q1  -1.48   UD
1  Q2   1.70   UD
2  Q3  -6.18   UD
3  Q4   0.25   UD
4  Q5   sum(d) UD
5  Q6   0.25   UD

I want to filter out only the rows in column 'num' that are NON-NUMERIC. I want all of the columns for only the rows that contain non-numeric values for column 'num'.

desired output:

  N-D   num   unit
4  Q5   sum(d) UD

my attempts:

nonnumeric=df[~df.applymap(np.isreal).all(1)] #didn't work, it pulled out everything, besides i want the condition to check only column 'num'. 

nonnumeric=df['num'][~df.applymap(np.isreal).all(1)] #didn't work, it pulled out all the rows for column 'num' only.

non_num_rows = list(df[pd.to_numeric(df['num'], errors='coerce').isnull()].index) — Jagannath Banerjee, Sep 29 '18 at 14:19

score 27 · Accepted Answer · edited Nov 07 '17 at 23:40

Use boolean indexing with mask created by to_numeric + isnull
Note: This solution does not find or filter numbers saved as strings: like '1' or '22'

print (pd.to_numeric(df['num'], errors='coerce'))
0   -1.48
1    1.70
2   -6.18
3    0.25
4     NaN
5    0.25
Name: num, dtype: float64

print (pd.to_numeric(df['num'], errors='coerce').isnull())
0    False
1    False
2    False
3    False
4     True
5    False
Name: num, dtype: bool

print (df[pd.to_numeric(df['num'], errors='coerce').isnull()])
  N-D     num unit
4  Q5  sum(d)   UD

Another solution with isinstance and apply:

print (df[df['num'].apply(lambda x: isinstance(x, str))])
  N-D     num unit
4  Q5  sum(d)   UD

Mehdi · Answer 2 · 2020-07-13T15:40:42.543

Old topic, but if the numbers have been converted to 'str', type(x) == str is not working.

Instead, it's better to use isnumeric() or isdigit().

df = df[df['num'].apply(lambda x: not x.isnumeric())]

I tested all three approaches on my own dataframe with 200k+ rows, assuming numbers have been converted to 'str' by pd.read_csv().

def f1():
    df[pd.to_numeric(df['num'], errors='coerce').isnull()]

def f2():
    df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]

def f3():
    df[df['num'].apply(lambda x: not x.isnumeric())]

I got following execution times by running each function 10 times.

timeit.timeit(f1, number=10)
1.04128568888882

timeit.timeit(f2, number=10)
1.959099448888992

timeit.timeit(f3, number=10)
0.48741375999998127

Conculsion: fastest method is isnumeric(), slowest is regular expression method.

=========================================

Edit: As @set92 commented, isnumeric() works for integer only. So the fastest applicable function is pd.to_numeric() to have a universal solutions works for any type of numerical values.

It is possible to define a isfloat() function in python; but it will be slower than internal functions, especially for big DataFrames.

tmp=['4.0','4','4.5','1','test']*200000
df=pd.DataFrame(data=tmp,columns=['num'])


def f1():
    df[pd.to_numeric(df['num'], errors='coerce').isnull()]

def f2():
    df[df['num'].apply(lambda x: not isfloat(x))] 

def f3():
    df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]


print('to_numeric:',timeit.timeit(f1, number=10))
print('isfloat:',timeit.timeit(f2, number=10))
print('regular exp:',timeit.timeit(f3, number=10))

Results:

to_numeric: 8.303612694763615
isfloat: 9.972200270603594
regular exp: 11.420604273894583

It shouldn't work, df['num'] is float, and isnumeric() and isdigit() only works for integers. I was searching some function but seems the only solution available is create a function ```def isfloat(value): try: float(value) return True except ValueError: return False``` and with it change your solution to ```df['num'].loc[df['num'].apply(lambda x: not isfloat(x))]``` — set92, Jul 12 '20 at 12:32
but pd.to_numeric() works without any problem. It's an internal function and fast. No need to define a function! — Mehdi, Jul 13 '20 at 15:09
with pd.to_numeric I got an error the strings (```ValueError: Unable to parse string "Unrated" at position 32```). If you want to try the dataset was ramen ratings of [kaggle](https://www.kaggle.com/residentmario/ramen-ratings). But the problem was that the feature Stars has float numbers and 3 cells with "Unrated", so I couldn't find a one-line to apply it and find how many rows were not number. — set92, Jul 13 '20 at 18:39
you should use `pd.to_numeric(... ,errors='coerce')` to skip errors — Mehdi, Jul 15 '20 at 15:31
only `df[pd.to_numeric(df['num'], errors='coerce').isnull()]` could match floats — matt91t, Mar 03 '23 at 21:41

score 4 · Answer 3 · answered May 23 '17 at 16:42

4

I used

df = df[df['num'].apply(lambda x: type(x) == str)]

and now df is

  N-D     num unit
4  Q5  sum(d)   UD

answered May 23 '17 at 16:42

plasmon360

4,109
1
16
19

score 2 · Answer 4 · edited May 23 '17 at 17:59

2

Assuming these are strings, you can filter based on a regular expression match of a floating point number.

df[~df.num.str.match('^\-?(\d*\.?\d+|\d+\.?\d*)$')]

  N-D     num unit
4  Q5  sum(d)   UD

edited May 23 '17 at 17:59

Jan

42,290
8
54
79

answered May 23 '17 at 16:58

piRSquared

285,575
57
475
624

score 0 · Answer 5 · answered May 09 '20 at 15:02

0

There are many ways to detect non-numeric values in the column of pandas DataFrame, here is one.

df[~df['num'].map(lambda x:x.isnumeric())]

answered May 09 '20 at 15:02

ravibeli

484
9
30

get non numerical rows in a column pandas python

5 Answers5

Linked