Removing values with mixed datatypes

Question

I have the following DataFrame

A       B       C
1.0     abc     1.0
abc     1.0     abc
-1.11   abc     abc

I have mixed datatypes (float and str). How can I drop values <= -1 in column A.

I get an error if I do the following because of the mixed datatypes

df['A'] = (df['A'] != "abc") & (df['A'] > -1)
TypeError: '>' not supported between instances of 'str' and 'int'

How can I change my object to make abc a str and 1.0 a float so I can:

(df['A'] != "abc") & (df['A'] > -1)

print(df['A'].dtype)
    -> object

I would like the expected output

df = 

A       B       C
1.0     abc     1.0
abc     1.0     abc
NaN     abc     abc

What do you mean "I get an error"? Please give us a [mcve] that shows us the actual exception traceback and the actual code (and reproducible sample data) that produces it, or we can't debug it. — abarnert, Jun 04 '18 at 19:57
As a wild guess, I think you're looking for a short-circuiting operator here, which you can't do "externally" over a series, but worst-case scenario you can always `apply` or `map` something like `lambda x: x != "abc" and x > -1`. Or, maybe more simply, you can filter it in two steps instead of one: first remove the strings, then you have a series of floats so you can remove the too-small floats. — abarnert, Jun 04 '18 at 19:58
Also, if your only non-float value is the string `'abc`', do you really want dtype object for that, instead of using dtype `float` and converting that `'abc'` into a `NaN` float value? — abarnert, Jun 04 '18 at 19:59
I want to keep `abc` in my dataframe but remove any `float < -1.0` — satoshi, Jun 04 '18 at 20:28

jpp · Accepted Answer · 2018-06-05T12:03:54.570

3

There are at least a couple of different approaches to this problem.

loc + pd.to_numeric

pd.DataFrame.loc accepts Boolean series, so you can calculate a mask via pd.to_numeric and feed into the loc setter.

Note there is no need to specify df['A'] != 'abc' because the mask series will convert these values to NaN.

mask = pd.to_numeric(df['A'], errors='coerce') < -1
df.loc[mask, 'A'] = np.nan

print(df)

     A    B    C
0    1  abc    1
1  abc    1  abc
2  NaN  abc  abc

try / except

See @Jan's solution. This solution is preferable if you expect values to be numeric and are only looking for alternative treatment in edge cases.

edited Jun 05 '18 at 12:03

answered Jun 04 '18 at 20:33

jpp

159,742
34
281
339

Thank you! I then included `df['A'] = df['A'].apply(pd.to_numeric, errors='ignore', downcast='float')` because when I open the file in excel it stores it as text. Please know if this seems efficient to you. – satoshi Jun 04 '18 at 21:38
1

@g_altobelli, That seems fine. Note it's not a requirement to set `downcast='float'`, but it should do no harm. – jpp Jun 04 '18 at 21:41
1

Thank you for all your help. – satoshi Jun 04 '18 at 21:43

Jan · Answer 2 · 2018-06-04T20:22:21.237

2

You could use a little function converting the values to numbers:

import pandas as pd, numpy as np

df = pd.DataFrame({
    'A': [1.0, 'abc', -1.11], 
    'B': ['abc', 1.0, 'abc'], 
    'C': [1.0, 'abc', 'abc']})

def change(value):
    try:
        v = float(value)
        if v < -1:
            value = np.nan
    except ValueError:
        pass
    return value

df['A'] = df['A'].apply(change)
print(df)

This yields

     A    B    C
0    1  abc    1
1  abc    1  abc
2  NaN  abc  abc

edited Jun 04 '18 at 20:22

answered Jun 04 '18 at 20:00

Jan

42,290
8
54
79

Also, I see no reason you need to convert to `float`. If it's not possible, it should be caught by `ValueError`. – jpp Jun 04 '18 at 20:32
This works but why do I have to do `try` and `except`? Why cant I just tell `pandas` to keep `str` and `drop any float < -1.0`? – satoshi Jun 04 '18 at 20:41
@jpp: You'll get a type error if you don't convert it (TypeError: '<' not supported between instances of 'str' and 'int'). – Jan Jun 05 '18 at 05:24
You can include a tuple of exceptions :) – jpp Jun 05 '18 at 07:06

harvpan · Answer 3 · 2018-06-04T20:22:43.073

0

You need the indices of the rows where your condition is met. You can convert series to numeric, which would turn string into NaN, then look for your condition and simply remove the row. Note that we are not saving the series with NaN back to the dataframe.

ser = pd.to_numeric(df['A'], errors='coerce')
df.loc[df['A']!='abc'].drop(list(ser[ser<=-1].index))

Output:

    A   B   C
0  1.0  abc 1.0

edited Jun 04 '18 at 20:22

answered Jun 04 '18 at 20:00

harvpan

8,571
2
18
36

2

I'm curious why the downvote myself, because this looks like a potential solution. – Jimmy Smith Jun 04 '18 at 20:22
1

This solution is fine – jpp Jun 04 '18 at 20:22

Removing values with mixed datatypes

3 Answers3

loc + pd.to_numeric

try / except