Pandas : finding out wrong data

Question

Let's says there is a pandas data frame as below: {a:[1,2,3,4], b:[1,2,3,?]} Assuming values within the strings 'a' and 'b' are more than a thousand, and we do not know yet there is '?' in the series b. Thus, we are keeping having 'object type' when it comes to 'b'

How can we find out at which row exist non-float(non-integer) value?

Matthias Fripp · Answer 1 · 2017-10-27T03:10:12.277

You could use something like this:

import pandas as pd

def make_float(v):
    try:
        return float(v)
    except:
        return pd.np.nan

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, '?']})

df_float = df.applymap(make_float)
# or just df_float = df.apply(pd.to_numeric, errors='coerce')

After this, df_float will be of type float and will have NaN values wherever invalid entries occurred. This will convert valid number strings (e.g., '0.7') to floats; you have to decide whether that's a good thing.

You can then find the location of the NaN values (which were formerly non-convertable entries in df) via this code (from https://stackoverflow.com/a/33641639/3830997):

df_nan = df_float.unstack()
df_nan = df_nan[df_nan.isnull()]
df_nan
# b  3    NaN

score 1 · Answer 2 · answered Oct 27 '17 at 02:29

1

You can easily using pandas achieve this :

df.apply(pd.to_numeric,errors='coerce').isnull().any()
Out[795]: 
a    False
b     True
dtype: bool

Data Input

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, '?']})

answered Oct 27 '17 at 02:29

BENY

317,841
20
164
234

score 0 · Answer 3 · answered Oct 27 '17 at 03:31

Say you have multiple rows in the same column that are not numbers,

df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':['1','2','3','?', '?', 4]})

You can get the indices of all those non-numbers using,

pd.isnull(pd.to_numeric(df['b'], errors='coerce')).nonzero()[0]

You get

array([3, 4])

If you need to do this over multiple columns like in this df,

df = pd.DataFrame({'a':[1,'?',3,4,5,6], 'b':['1','2','3','?', '?', 4]})

Try

pd.isnull(df.apply(lambda x: pd.to_numeric(x, errors='coerce'))).any(1).nonzero()[0]

And you get

array([1, 3, 4])

Pandas : finding out wrong data

3 Answers3