python: convert numerical data in pandas dataframe to floats in the presence of strings

Question

I've got a pandas dataframe with a column 'cap'. This column mostly consists of floats but has a few strings in it, for instance at index 2.

df =
    cap
0    5.2
1    na
2    2.2
3    7.6
4    7.5
5    3.0
...

I import my data from a csv file like so:

df = DataFrame(pd.read_csv(myfile.file))

Unfortunately, when I do this, the column 'cap' is imported entirely as strings. I would like floats to be identified as floats and strings as strings. Trying to convert this using:

df['cap'] = df['cap'].astype(float)

throws up an error:

could not convert string to float: na

Is there any way to make all the numbers into floats but keep the 'na' as a string?

I think this might be hurtful for performance. Mixed types in a column force you `dtype=object` while you might get along with `dtype=float`. There's a special "not a number" float value (`numpy.nan`) that is good for indicating missing data. Also [Pandas understands it very well.](http://pandas.pydata.org/pandas-docs/dev/missing_data.html) — Kos, Nov 08 '13 at 17:13
you just need to specify ``na_values=['na']`` in the ``read_csv`` and it will work, see [here](http://pandas.pydata.org/pandas-docs/dev/io.html#na-values) — Jeff, Nov 08 '13 at 17:15
alternatively you can use ``df.convert_objects(convert_numeric='force')`` to force non-numeric to ``nan`` (implemented in cython so its very fast) — Jeff, Nov 08 '13 at 17:29
You consider accepting this [answer](http://stackoverflow.com/a/19866269/2087463). — tmthydvnprt, Mar 20 '16 at 17:58

score 21 · Answer 1 · edited May 23 '17 at 12:09

Calculations with columns of float64 dtype (rather than object) are much more efficient, so this is usually preferred... it will also allow you to do other calculations. Because of this is recommended to use NaN for missing data (rather than your own placeholder, or None).

Is this really the answer you want?

In [11]: df.sum()  # all strings
Out[11]: 
cap    5.2na2.27.67.53.0
dtype: object

In [12]: df.apply(lambda f: to_number(f[0]), axis=1).sum()  # floats and 'na' strings
TypeError: unsupported operand type(s) for +: 'float' and 'str'

You should use convert_numeric to coerce to floats:

In [21]: df.convert_objects(convert_numeric=True)
Out[21]: 
   cap
0  5.2
1  NaN
2  2.2
3  7.6
4  7.5
5  3.0

Or read it in directly as a csv, by appending 'na' to the list of values to be considered NaN:

In [22]: pd.read_csv(myfile.file, na_values=['na'])
Out[22]: 
   cap
0  5.2
1  NaN
2  2.2
3  7.6
4  7.5
5  3.0

In either case, sum (and many other pandas functions) will now work:

In [23]: df.sum()
Out[23]:
cap    25.5
dtype: float64

As Jeff advises:

repeat 3 times fast: object==bad, float==good

score 2 · Answer 2 · answered Jan 30 '18 at 04:48

First of all the way you import you CSV is redundant, instead of doing:

df = DataFrame(pd.read_csv(myfile.file))

You can do directly:

df = pd.read_csv(myfile.file)

Then to convert to float, and put whatever is not a number as NaN:

df = pd.to_numeric(df, errors='coerce')

Acorbe · Accepted Answer · 2013-11-08T17:10:12.660

1

Here is a possible workaround

first you define a function that converts numbers to float only when needed

 def to_number(s):
    try:
        s1 = float(s)
        return s1
    except ValueError:
        return s

and then you apply it row by row.

Example:

given

where both a and 2 are strings, we do the conversion via

converted = df.apply(lambda f : to_number(f[0]) , axis = 1)  

 converted
 0    a
 1    2

A direct check on the types:

type(converted.iloc[0])                                                                                                                             
str

type(converted.iloc[1])                                                                                                                             
float

edited Nov 08 '13 at 17:10

answered Nov 08 '13 at 16:51

Acorbe

8,367
5
37
66

1

Although this works to do what the OP asks, doing it is not really useful imo since you can't do any analysis... – Andy Hayden Nov 08 '13 at 18:41
@AndyHayden, I do agree..Thanks for posting your solution – Acorbe Nov 08 '13 at 19:17

score 1 · Answer 4 · answered May 04 '14 at 10:03

1

I tried an alternative on the above:

for num, item in enumerate(data['col']):
    try:
        float(item)
    except:
        data['col'][num] = nan

answered May 04 '14 at 10:03

reabow

219
1
6
18

python: convert numerical data in pandas dataframe to floats in the presence of strings

4 Answers4

Is this really the answer you want?

repeat 3 times fast: object==bad, float==good

Linked