4

I have a dataframe like below

    a   b
0   1   26190
1   5   python
2   5   580

I want to make column b to host only integers, but as you can see python is not int convertible, so I want to delete the row at index 1. My expected out put has to be like

    a   b
0   1   26190
1   5   580

How to filter and remove using pandas in python?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
billboard
  • 785
  • 4
  • 13
  • 25
  • I found a work around here http://stackoverflow.com/questions/15891038/pandas-change-data-type-of-columns. But this only replaces the bad data with NaN – billboard Sep 26 '16 at 04:04
  • Use it: convert to int then dropna against the column to get rid of the rows – Zeugma Sep 26 '16 at 04:13

2 Answers2

6

You can use to_numeric with notnull and filter by boolean indexing:

print (pd.to_numeric(df.b, errors='coerce'))
0    26190.0
1        NaN
2      580.0
Name: b, dtype: float64

print (pd.to_numeric(df.b, errors='coerce').notnull())
0     True
1    False
2     True
Name: b, dtype: bool

df = df[pd.to_numeric(df.b, errors='coerce').notnull()]
print (df)

   a      b
0  1  26190
2  5    580

Another solution by comment of Boud - use to_numeric with dropna and last convert to int by astype:

df.b = pd.to_numeric(df.b, errors='coerce')
df = df.dropna(subset=['b'])
df.b = df.b.astype(int)
print (df)
   a      b
0  1  26190
2  5    580

If need check all rows with bad data use isnull - filter all data where after applying function to_numeric get NaN:

print (pd.to_numeric(df.b, errors='coerce').isnull())
0    False
1     True
2    False
Name: b, dtype: bool

print (df[pd.to_numeric(df.b, errors='coerce').isnull()])
   a       b
1  5  python
ah bon
  • 9,293
  • 12
  • 65
  • 148
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

This should work

import pandas as pd
import numpy as np

df = pd.DataFrame({'a' : [1, 5, 5],
                   'b' : [26190, 'python', 580]})
df
   a       b
0  1   26190
1  5  python
2  5     580

df['b'] = np.where(df.b.str.contains('[a-z]') == True, np.NaN, df.b)
df
   a      b
0  1  26190
1  5    NaN
2  5    580

df = df.dropna()
df
   a      b
0  1  26190
2  5    580

You use the regex to identify strings, then convert these to np.NaN using np.where then drop them from the df with df.dropna().

conor
  • 1,267
  • 10
  • 7