ValueError: could not convert string to float: " " (empty string?)

Question

How do I go about removing an empty string or at least having regex ignore it?

I have some data that looks like this

EIV (5.11 gCO₂/t·nm)

I'm trying to extract the numbers only. I have done the following:

df['new column'] = df['column containing that value'].str.extract(r'((\d+.\d*)|(\d+)|(\.\d+)|(\d+[eE][+]?\d*)?)').astype('float')

since the numbers Can be floats, integers, and I think there's one exponent 4E+1

However when I run it I then get the error as in title which I presume is an empty string.

What am I missing here to allow the code to run?

I think you should use a single capture group `([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)` — The fourth bird, Oct 19 '21 at 12:39

score 0 · Answer 1 · answered Oct 19 '21 at 12:42

0

Try this

import re

c = "EIV (5.11 gCO₂/t·nm)"

x = re.findall("[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?", c)
print(x)

Will give

['5.11']

answered Oct 19 '21 at 12:42

score 0 · Answer 2 · answered Oct 19 '21 at 12:45

The problem is not only the number of groups, but the fact that the last alternative in your regex is optional (see ? added right after it, and your regex demo). However, since Series.str.extract returns the first match, your regex matches and returns the empty string at the start of the string if the match is not at the string start position.

It is best to use the well-known single alternative patterns to match any numbers with a single capturing group, e.g.

df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)

See Example Regexes to Match Common Programming Language Constructs.

Pandas test:

import pandas as pd
df = pd.DataFrame({'col':['EIV (5.11 gCO₂/t·nm)', 'EIV (5.11E+12 gCO₂/t·nm)']})
df['col'].str.extract(r'((?:(?:\b[0-9]+)?\.)?\b[0-9]+(?:[eE][-+]?[0-9]+)?)\b').astype(float)
# =>               0
#    0  5.110000e+00
#    1  5.110000e+12

There also quite a lot of other such regex variations at Parsing scientific notation sensibly?, and you may also use r"([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)", r"(-?\d+(?:\.\d*)?(?:[eE][+-]?\d+)?)", r"([+-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?)", etc.

score 0 · Answer 3 · answered Oct 19 '21 at 13:07

0

If your column consist of data of same format(as you have posted - EIV (5.11 gCO₂/t·nm)) then it will surely work

import pandas as pd    
df['new_exctracted_column'] = df['column containing that value'].str.extract('(\d+(?:\.\d+)?)')
df

5.11

answered Oct 19 '21 at 13:07

KReEd

358
4
18

ValueError: could not convert string to float: " " (empty string?)

3 Answers3