I have some concatenated text data in a Pandas series which I want to split out into 3 columns.
The string in each "cell" within the Series is made up in 3 parts like this:
[pesticide_name][amount_detected_(mg/kg)][MRL]
I have explored using series.str.split(...
but I think that s.str.extract(...
followed by regex capture groups will be more effective. However, I am new to regex and this is proving to be a great challenge
The solution I am trying to work with is here.
Here's a sample of the series:
df['pesticide_residues_found_in_mg/kg_(mrl)'].head(20)
# 0 Spirotetramat (partial sum) 0.03 (MRL = 2)
# 1 n/a
# 2 n/a
# 3 n/a
# 4 n/a
# 5 n/a
# 6 n/a
# 7 fluopyram 0.01 (MRL = 0.9)
# 8 fenpyrazamine 0.02 (MRL = 3)
# 9 fluopyram 0.05 (MRL = 0.9)
# 10 acetamiprid 0.03 (MRL = 0.2)
# 11 cyprodinil 0.04 (MRL = 1.5)
# 12 fludioxonil 0.02 (MRL = 0.4)
# 13 fenpyrazamine 0.07 (MRL = 3)
# 14 thiacloprid 0.02 (MRL = 0.7)
# 15 acetamiprid 0.04 (MRL = 0.2)
# 16 chlorothalonil 0.03 (MRL = 6)
# 17 cyprodinil 0.1 (MRL = 1.5)
# 18 fludioxonil 0.03 (MRL = 0.4)
# 19 pyrimethanil 0.09 (MRL = 1)
# Name: pesticide_residues_found_in_mg/kg_(mrl), dtype: object
The information I would like to extract out from this series is: 1) the pesticide name, which is the first word. 2) the amount detected, which is decimal or float, expressed to one or two decimal places. 3) the MRL, however I would like to only capture the number, not the parenthesis or "MRL = "
Notes: *Pesticide names: sometimes is sometimes a two-part word which is hyphenated, e.g. "lambda-cyhalothrin". *Pesticide names: sometimes this name is followed by extra information in parenthesis, such as "(sum)" or "(partial sum)". *Amounts detected: while the number is usually expressed to one or two decimal places, it is conceivable that the amount detected will be a whole number, e.g '4' or '20'.
Code I have tried:
df['pesticide_residues_found_in_mg/kg_(mrl)'].str.extract(r'(?P<mrl>\(MRL = \d.?\d+?\))')
# This works but captures "MRL = " but if I remove this, it tends to capture the amount detected instead, so "MRL = " identifies the correct number although it's junk I do not want.
df['pesticide_residues_found_in_mg/kg_(mrl)'].str.extract(r'(?P<mrl>\d+\.?\d+?)'
#This doesn't work and results in capturing the amount detected part of the string instead, and only to one decimal place too!
An attempt at a regex capture group for the amount detected might be: (?P\d+.?\d{1,2}?)
I have also attempted to use markers such as \b
, ^
and $
in order to mark word boundaries and the string's beginning and end, but couldn't seem to make this work either.
An example of the new series in my df I would like to achieve:
index - chem_name - amount_detected - mrl
0 - chlorothalonil - 0.03 - 0.1
1 - fenpyrazamine - 0.1 - 3
2 | ddt (sum) | 2.45 | 0