1

I am using Pandas to analyze data from csv. The dataframe look like this:

    tech_nbr    door_age    service_spend   service_calls
0   2   -7,987  1   3
1   3   -7,987  1   3
2   231561  -7,987  1   3
3   2531885 13  1   3
4   A451349 9   1   3

Now I want to filter out all the rows with negative door_age such as row 0 and 1 using the following command.

df_filtered = df.filter(df.door_age > 0)

However I got error:

TypeError: '>' not supported between instances of 'str' and 'int'

I guess there some values of ages are not numeric, so I added the following line to drop rows with non-numeric door_age based on Remove non-numeric rows in one column with pandas

df[df.door_age.apply(lambda x: x.isnumeric())]

It did seem to remove a lot of rows, but I still got the same error. So I also filtered out rows with null values for door_age using `df = df.dropna(subset=['door_age']). However it did not help either.

Why am I still getting this error?

Community
  • 1
  • 1
ddd
  • 4,665
  • 14
  • 69
  • 125
  • Can you *explicitly check* the `dtype` of your numeric column before before and after your attempt to remove non-numeric rows? you can use `df.dtypes` or `series.dtype` for this. – jpp Mar 30 '18 at 21:10
  • @jpp it is `object` before and after. Should I change the whole column type at the beginning then? – ddd Mar 30 '18 at 21:17
  • Yes, use `df[col] = pd.to_numeric(df['col'], errors='coerce')`. Non-numeric values will become `np.nan`. – jpp Mar 30 '18 at 21:19
  • @jpp it works now. thanks – ddd Mar 30 '18 at 21:25

1 Answers1

1

You need to ensure your series is of numeric type before you attempt any numeric calculations.

In this case, you can coerce non-numeric values to np.nan:

df['door_age'] = pd.to_numeric(df['door_age'], errors='coerce')
jpp
  • 159,742
  • 34
  • 281
  • 339