4

I have a column Column1 in a pandas dataframe which is of type str, values which are in the following form:

import pandas as pd
df = pd.read_table("filename.dat")
type(df["Column1"].ix[0])   #outputs 'str'
print(df["Column1"].ix[0])

which outputs '1/350'. So, this is currently a string. I would like to convert it into a float.

I tried this:

df["Column1"] = df["Column1"].astype('float64', raise_on_error = False)

But this didn't change the values into floats.

This also failed:

df["Column1"] = df["Column1"].convert_objects(convert_numeric=True)

And this failed:

df["Column1"] = df["Column1"].apply(pd.to_numeric, args=('coerce',))

How do I convert all the values of column "Column1" into floats? Could I somehow use regex to remove the parentheses?

EDIT:

The line

df["Meth"] = df["Meth"].apply(eval)

works, but only if I use it twice, i.e.

df["Meth"] = df["Meth"].apply(eval)
df["Meth"] = df["Meth"].apply(eval)

Why would this be?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

3 Answers3

4

You need to evaluate the expression (e.g. '1/350') in order to get the result, for which you can use Python's eval() function.

By wrapping Panda's apply() function around it, you can then execute the eval() function on every value in your column. Example:

df["Column1"].apply(eval)

As you're interpreting literals, you can also use the ast.literal_eval function as noted in the docs. Update: This won't work, as the use of literal_eval() is still restricted to additions and subtractions (source).

Remark: as mentioned in other answers and comments on this question, the use of eval() is not without risks, as you're basically executing whatever input is passed in. In other words, if your input contains malicious code, you're giving it a free pass.

Alternative option:

# Define a custom div function
def div(a,b):
    return int(a)/int(b)

# Split each string and pass the values to div
df_floats = df['col1'].apply(lambda x: div(*x.split('/')))

Second alternative in case of unclean data:

By using regular expressions, we can remove any non-digits appearing resp. before the numerator and after the denominator.

# Define a custom div function (unchanged)
def div(a,b):
    return int(a)/int(b)

# We'll import the re module and define a precompiled pattern
import re
regex = re.compile('\D*(\d+)/(\d+)\D*')

df_floats = df['col1'].apply(lambda x: div(*regex.findall(x)[0]))

We'll lose a bit of performance, but the upside is that even with input like '!erefdfs?^dfsdf1/350dqsd qsd qs d', we still end up with the value of 1/350.

Performance:

When timing both options on a dataframe with 100.000 rows, the second option (using the user defined div function) clearly wins:

  • using eval: 1 loop, best of 3: 1.41 s per loop
  • using div: 10 loops, best of 3: 159 ms per loop
  • using re: 1 loop, best of 3: 275 ms per loop
Community
  • 1
  • 1
DocZerø
  • 8,037
  • 11
  • 38
  • 66
  • "this is currently a string. I would like to convert it into a float" – ShanZhengYang Aug 01 '16 at 20:22
  • No problem :) Thank you for the help! – ShanZhengYang Aug 01 '16 at 20:25
  • 1
    +1 for the benchmark there, I was surprised at first that `eval` was slower, but maybe it makes sense. `eval` needs to account for whatever code might have been passed in, while your alternative explicitly works on the `/` – Jeff Aug 01 '16 at 20:59
  • The error I get for the above `div` function (which I like) is alueError: invalid literal for int() with base 10: `"'9"` I'm not sure how to fix this – ShanZhengYang Aug 01 '16 at 22:21
  • Looks like there's some errors in the formatting of the data then, with an extra single quote. That may mean there are other issues, but in this case, in this answer, change it to `return int(a.strip("'")/int(b.strip("'"))` - that is a double-quote, single-quote, double-quote. Should fix it. – Jeff Aug 01 '16 at 23:25
  • Updated the answer with an alternative based on regex to remove any unwanted characters at the beginning or the end of the string (non-digits). – DocZerø Aug 02 '16 at 06:07
3

I hate advocating for the use of eval. I didn't want to spend time on this answer but I was compelled because I don't want you to use eval.

So I wrote this function that works on a pd.Series

def do_math_in_string(s):
    op_map = {'/': '__div__', '*': '__mul__', '+': '__add__', '-': '__sub__'}
    df = s.str.extract(r'(\d+)(\D+)(\d+)', expand=True)
    df = df.stack().str.strip().unstack()
    df.iloc[:, 0] = pd.to_numeric(df.iloc[:, 0]).astype(float)
    df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2]).astype(float)
    def do_op(x):
        return getattr(x[0], op_map[x[1]])(x[2])
    return df.T.apply(do_op)

Demonstration

s = pd.Series(['1/2', '3/4', '4/5'])

do_math_in_string(s)

0    0.50
1    0.75
2    0.80
dtype: float64

do_math_in_string(pd.Series(['1/2', '3/4', '4/5', '6+5', '11-7', '9*10']))

0     0.50
1     0.75
2     0.80
3    11.00
4     4.00
5    90.00
dtype: float64

Please don't use eval.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • "You'll have better self esteem later in life if you avoid it now." Why? – ShanZhengYang Aug 01 '16 at 20:46
  • 1
    That comment is me being comedically dramatic. But in practice, using `eval` is a recipe for code that is difficult to manage, maintain, debug, and keep secure. It's very generally regarded as bad practice. – piRSquared Aug 01 '16 at 20:49
  • If he's not dealing with user input, how would using `eval` here be a problem? – Jeff Aug 01 '16 at 20:57
  • I'll say a couple more things: One, an input file can very well be user input. Two, it still makes it more difficult to track bugs and maintain (my belief). Three, all this is coding philosophy and doesn't really matter as the OP can do what he wants. My answer still does what it does, regardless of opinions on `eval`. – piRSquared Aug 01 '16 at 21:01
  • I was actually just curious if there was anything more to it than the problem that it evaluates whatever a user types in. It's a good thing to note for the OP though, that `eval` can be dangerous if the assumption that this isn't somehow involving user input is wrong. – Jeff Aug 01 '16 at 21:04
  • @JeffL. @piRSquared I have found odd behavior using `eval`. See the EDIT above. I wish your function worked :/ – ShanZhengYang Aug 01 '16 at 22:49
2

You can do it by applying eval to the column:

data = {'one':['1/20', '2/30']}
df = pd.DataFrame(data)

In [8]: df['one'].apply(eval)
Out[8]:
0    0.050000
1    0.066667
Name: one, dtype: float64
Jeff
  • 2,158
  • 1
  • 16
  • 29