4

I have the following DataFrame

A       B       C
1.0     abc     1.0
abc     1.0     abc
-1.11   abc     abc

I have mixed datatypes (float and str). How can I drop values <= -1 in column A.

I get an error if I do the following because of the mixed datatypes

df['A'] = (df['A'] != "abc") & (df['A'] > -1)
TypeError: '>' not supported between instances of 'str' and 'int'

How can I change my object to make abc a str and 1.0 a float so I can:

(df['A'] != "abc") & (df['A'] > -1)

print(df['A'].dtype)
    -> object

I would like the expected output

df = 

A       B       C
1.0     abc     1.0
abc     1.0     abc
NaN     abc     abc
jpp
  • 159,742
  • 34
  • 281
  • 339
satoshi
  • 439
  • 3
  • 14
  • What do you mean "I get an error"? Please give us a [mcve] that shows us the actual exception traceback and the actual code (and reproducible sample data) that produces it, or we can't debug it. – abarnert Jun 04 '18 at 19:57
  • As a wild guess, I think you're looking for a short-circuiting operator here, which you can't do "externally" over a series, but worst-case scenario you can always `apply` or `map` something like `lambda x: x != "abc" and x > -1`. Or, maybe more simply, you can filter it in two steps instead of one: first remove the strings, then you have a series of floats so you can remove the too-small floats. – abarnert Jun 04 '18 at 19:58
  • Also, if your only non-float value is the string `'abc`', do you really want dtype object for that, instead of using dtype `float` and converting that `'abc'` into a `NaN` float value? – abarnert Jun 04 '18 at 19:59
  • Do you want values <= -1 to be converted to `NaN`? – user3483203 Jun 04 '18 at 19:59
  • I want to keep `abc` in my dataframe but remove any `float < -1.0` – satoshi Jun 04 '18 at 20:28
  • @Tai ugh sorry I always forget to add my expected output. – satoshi Jun 04 '18 at 20:39

3 Answers3

3

There are at least a couple of different approaches to this problem.

loc + pd.to_numeric

pd.DataFrame.loc accepts Boolean series, so you can calculate a mask via pd.to_numeric and feed into the loc setter.

Note there is no need to specify df['A'] != 'abc' because the mask series will convert these values to NaN.

mask = pd.to_numeric(df['A'], errors='coerce') < -1
df.loc[mask, 'A'] = np.nan

print(df)

     A    B    C
0    1  abc    1
1  abc    1  abc
2  NaN  abc  abc

try / except

See @Jan's solution. This solution is preferable if you expect values to be numeric and are only looking for alternative treatment in edge cases.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thank you! I then included `df['A'] = df['A'].apply(pd.to_numeric, errors='ignore', downcast='float')` because when I open the file in excel it stores it as text. Please know if this seems efficient to you. – satoshi Jun 04 '18 at 21:38
  • 1
    @g_altobelli, That seems fine. Note it's not a requirement to set `downcast='float'`, but it should do no harm. – jpp Jun 04 '18 at 21:41
  • 1
    Thank you for all your help. – satoshi Jun 04 '18 at 21:43
2

You could use a little function converting the values to numbers:

import pandas as pd, numpy as np

df = pd.DataFrame({
    'A': [1.0, 'abc', -1.11], 
    'B': ['abc', 1.0, 'abc'], 
    'C': [1.0, 'abc', 'abc']})

def change(value):
    try:
        v = float(value)
        if v < -1:
            value = np.nan
    except ValueError:
        pass
    return value

df['A'] = df['A'].apply(change)
print(df)

This yields

     A    B    C
0    1  abc    1
1  abc    1  abc
2  NaN  abc  abc
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Also, I see no reason you need to convert to `float`. If it's not possible, it should be caught by `ValueError`. – jpp Jun 04 '18 at 20:32
  • This works but why do I have to do `try` and `except`? Why cant I just tell `pandas` to keep `str` and `drop any float < -1.0`? – satoshi Jun 04 '18 at 20:41
  • @jpp: You'll get a type error if you don't convert it (TypeError: '<' not supported between instances of 'str' and 'int'). – Jan Jun 05 '18 at 05:24
  • You can include a tuple of exceptions :) – jpp Jun 05 '18 at 07:06
0

You need the indices of the rows where your condition is met. You can convert series to numeric, which would turn string into NaN, then look for your condition and simply remove the row. Note that we are not saving the series with NaN back to the dataframe.

ser = pd.to_numeric(df['A'], errors='coerce')
df.loc[df['A']!='abc'].drop(list(ser[ser<=-1].index))

Output:

    A   B   C
0  1.0  abc 1.0
harvpan
  • 8,571
  • 2
  • 18
  • 36