Remove non-numeric rows in one column with pandas

Question

There is a dataframe like the following, and it has one unclean column 'id' which it sholud be numeric column

id, name
1,  A
2,  B
3,  C
tt, D
4,  E
5,  F
de, G

Is there a concise way to remove the rows because tt and de are not numeric values

tt,D
de,G

to make the dataframe clean?

id, name
1,  A
2,  B
3,  C
4,  E
5,  F

Your **example column has str type everywhere**, not mixed-type like (say) `pd.Series([1, 2.2, True, 'ignore'])`. Some solutions here rely on assuming str type or methods. To not break on more general mixed-type column, you'll need`s[s.str.isnumeric() == True]` since `s.str.isnumeric() ` returns `Nan` for int and float. — smci, Jun 27 '22 at 20:27

score 122 · Answer 1 · answered Oct 04 '17 at 20:08

122

Using pd.to_numeric

In [1079]: df[pd.to_numeric(df['id'], errors='coerce').notnull()]
Out[1079]:
  id  name
0  1     A
1  2     B
2  3     C
4  4     E
5  5     F

answered Oct 04 '17 at 20:08

Zero

74,117
18
147
154

If you want to retain the column as a number and not an object, you might need to call `pd.to_numeric` separately. – Tulio Casagrande Jan 04 '18 at 13:19
1

No `apply` :D ! – Little Bobby Tables Feb 12 '18 at 16:21
(It's not inplace) – Super Mario Dec 03 '20 at 00:36
1

Is there a way to do this if NaN is a valid value for the column but not a non-numeric like "abc"? This would filter out both nulls and non-numerics – Stan Ostrovskii Apr 19 '22 at 20:28

Anton Protopopov · Accepted Answer · 2018-10-20T09:25:40.903

You could use standard method of strings isnumeric and apply it to each value in your id column:

import pandas as pd
from io import StringIO

data = """
id,name
1,A
2,B
3,C
tt,D
4,E
5,F
de,G
"""

df = pd.read_csv(StringIO(data))

In [55]: df
Out[55]: 
   id name
0   1    A
1   2    B
2   3    C
3  tt    D
4   4    E
5   5    F
6  de    G

In [56]: df[df.id.apply(lambda x: x.isnumeric())]
Out[56]: 
  id name
0  1    A
1  2    B
2  3    C
4  4    E
5  5    F

Or if you want to use id as index you could do:

In [61]: df[df.id.apply(lambda x: x.isnumeric())].set_index('id')
Out[61]: 
   name
id     
1     A
2     B
3     C
4     E
5     F

Edit. Add timings

Although case with pd.to_numeric is not using apply method it is almost two times slower than with applying np.isnumeric for str columns. Also I add option with using pandas str.isnumeric which is less typing and still faster then using pd.to_numeric. But pd.to_numeric is more general because it could work with any data types (not only strings).

df_big = pd.concat([df]*10000)

In [3]: df_big = pd.concat([df]*10000)

In [4]: df_big.shape
Out[4]: (70000, 2)

In [5]: %timeit df_big[df_big.id.apply(lambda x: x.isnumeric())]
15.3 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit df_big[df_big.id.str.isnumeric()]
20.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_big[pd.to_numeric(df_big['id'], errors='coerce').notnull()]
29.9 ms ± 682 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I tested it on python 2, but `pd.to_numeric` is faster than `isdigit()`. `%timeit subx1=sub[sub.BAN.apply(lambda x: str(x).isdigit())]` is 5.67 sec. Compare to 4.46 sec for `%timeit subx2=sub[pd.to_numeric(sub.BAN, errors='coerce').notnull()]`. Python 2 `str` does not has `isnumeric()` — notilas, May 28 '19 at 18:53

DeepSpace · Answer 3 · 2015-11-27T16:15:20.567

Given that df is your dataframe,

import numpy as np
df[df['id'].apply(lambda x: isinstance(x, (int, np.int64)))]

What it does is passing each value in the id column to the isinstance function and checks if it's an int. Then it returns a boolean array, and finally returning only the rows where there is True.

If you also need to account for float values, another option is:

import numpy as np
df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

Note that either way is not inplace, so you will need to reassign it to your original df, or create a new one:

df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]
# or
new_df = df[df['id'].apply(lambda x: type(x) in [int, np.int64, float, np.float64])]

The code in the first block empty completed my dataframe. – RobyB Apr 01 '23 at 17:25 — RobyB, Apr 01 '23 at 17:25

score 5 · Answer 4 · answered Oct 01 '19 at 08:24

x.isnumeric() does not test return True when x is of type float.

One way to filter out values which can be converted to float:

df[df['id'].apply(lambda x: is_float(x))]

def is_float(x):
    try:
        float(x)
    except ValueError:
        return False
    return True

yeiichi · Answer 5 · 2021-06-22T01:31:26.277

How about this? The .str accessor is one of my favorites :)

import pandas as pd


df = pd.DataFrame(
    {
        'id':   {0: '1', 1: '2', 2: '3', 3: 'tt', 4: '4', 5: '5', 6: 'de'},
        'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D',  4: 'E', 5: 'F', 6: 'G'}
    }
)

df_clean = df[df.id.str.isnumeric()]

Supplement (2021-06-22)

If the id contains some kind of headache-makers (such as float, None, nan), you can forcefully cast them to the str data type using astype('str').

import numpy as np
import pandas as pd


df = pd.DataFrame(
    {
        'id':   {0: '1', 1: '2', 2: '3', 3: 3.14, 4: '4', 5: '5', 6: None, 7: np.nan},
        'name': {0: 'A', 1: 'B', 2: 'C', 3: 'D',  4: 'E', 5: 'F', 6: 'G',  7: 'H'}
    }
)

df_clean = df[df.id.astype('str').str.isnumeric()]

Primitive, but it works anyway.

score 1 · Answer 6 · answered May 24 '22 at 19:28

1

Another alternative is to use the query method:

In [5]: df.query('id.str.isnumeric()')
Out[5]: 
  id  name
0  1     A
1  2     B
2  3     C
4  4     E
5  5     F

answered May 24 '22 at 19:28

rachwa

1,805
1
14
17

score 0 · Answer 7 · answered Sep 28 '21 at 15:45

This is a dynamic way to do it, this only works for int64 and float 64, if you have other numeric data types in your dataframe make sure you add them to the if statement

# make dataframe of column data types
col_types = df.dtypes.to_frame()
col_types.columns = ['dtype']

#make list of zeros
drop_it = [0]*col_types.shape[0]
k = 0

#make it a one if the data isn't numeric
#if you have other numeric types you need to add them to if statement
for t in col_types.dtype:
    if t != 'int64' and t != 'float64':
        drop_it[k] = 1
    k = k + 1

#delete types from drop list that aren't numeric
col_types['drop_it'] = drop_it
col_types = col_types.loc[col_types["drop_it"] == 1]

#finally drop columns that are in drop list
for col_to_drop in col_types.index.values.tolist():
    df = df.drop([col_to_drop], axis = 1)

score 0 · Answer 8 · answered Mar 25 '23 at 15:20

0

System.ObjectDisposedException: 'Cannot access a disposed object. Object name: 'IServiceProvider'.'

Try to delete the bin folder and rebuild it again. It works for me.

answered Mar 25 '23 at 15:20

Balaji

1

This answer is not related to the question. Could be another question. – ss301 Mar 30 '23 at 14:26
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 30 '23 at 14:26

Remove non-numeric rows in one column with pandas

8 Answers8

Edit. Add timings

Linked

Related