Python: Using Regex to Extract Information from Pandas Series

Question

I have some concatenated text data in a Pandas series which I want to split out into 3 columns.

The string in each "cell" within the Series is made up in 3 parts like this:

[pesticide_name][amount_detected_(mg/kg)][MRL]

I have explored using series.str.split(... but I think that s.str.extract(... followed by regex capture groups will be more effective. However, I am new to regex and this is proving to be a great challenge

The solution I am trying to work with is here.

Here's a sample of the series:

df['pesticide_residues_found_in_mg/kg_(mrl)'].head(20)
# 0     Spirotetramat (partial sum) 0.03 (MRL = 2)
# 1                                            n/a
# 2                                            n/a
# 3                                            n/a
# 4                                            n/a
# 5                                            n/a
# 6                                            n/a
# 7                     fluopyram 0.01 (MRL = 0.9)
# 8                   fenpyrazamine 0.02 (MRL = 3)
# 9                     fluopyram 0.05 (MRL = 0.9)
# 10                  acetamiprid 0.03 (MRL = 0.2)
# 11                   cyprodinil 0.04 (MRL = 1.5)
# 12                  fludioxonil 0.02 (MRL = 0.4)
# 13                  fenpyrazamine 0.07 (MRL = 3)
# 14                  thiacloprid 0.02 (MRL = 0.7)
# 15                  acetamiprid 0.04 (MRL = 0.2)
# 16                 chlorothalonil 0.03 (MRL = 6)
# 17                    cyprodinil 0.1 (MRL = 1.5)
# 18                  fludioxonil 0.03 (MRL = 0.4)
# 19                   pyrimethanil 0.09 (MRL = 1)
# Name: pesticide_residues_found_in_mg/kg_(mrl), dtype: object

The information I would like to extract out from this series is: 1) the pesticide name, which is the first word. 2) the amount detected, which is decimal or float, expressed to one or two decimal places. 3) the MRL, however I would like to only capture the number, not the parenthesis or "MRL = "

Notes: *Pesticide names: sometimes is sometimes a two-part word which is hyphenated, e.g. "lambda-cyhalothrin". *Pesticide names: sometimes this name is followed by extra information in parenthesis, such as "(sum)" or "(partial sum)". *Amounts detected: while the number is usually expressed to one or two decimal places, it is conceivable that the amount detected will be a whole number, e.g '4' or '20'.

Code I have tried:

df['pesticide_residues_found_in_mg/kg_(mrl)'].str.extract(r'(?P<mrl>\(MRL = \d.?\d+?\))') 
# This works but captures "MRL = " but if I remove this, it tends to capture the amount detected instead, so "MRL = " identifies the correct number although it's junk I do not want. 

df['pesticide_residues_found_in_mg/kg_(mrl)'].str.extract(r'(?P<mrl>\d+\.?\d+?)'
#This doesn't work and results in capturing the amount detected part of the string instead, and only to one decimal place too!

An attempt at a regex capture group for the amount detected might be: (?P\d+.?\d{1,2}?)

I have also attempted to use markers such as \b , ^ and $ in order to mark word boundaries and the string's beginning and end, but couldn't seem to make this work either.

An example of the new series in my df I would like to achieve:

index    - chem_name      - amount_detected - mrl 
0        - chlorothalonil - 0.03            - 0.1
1        - fenpyrazamine  - 0.1             - 3
2        | ddt (sum)      | 2.45            | 0

Maybe `.str.extract(r'(.*)\s(\d[\d.]*)\s+$MRL\s*=\s*(\d[\d.]*)$')`? Or `(?P.*)\s(?P\d[\d.]*)\s+$MRL\s*=\s*(?P\d[\d.]*)$` if you want named groups. — Wiktor Stribiżew, Aug 20 '19 at 13:35
That's very clear. Thank you. So I think (MRL\s*=\s*(?P\d[\d.]*)\) means that the named group is nested inside the the ignored "MRL" group. Is that correct? I will have a play around with that. — Westworld, Aug 20 '19 at 15:46
Ah! Yes, that opening parenthesis bracket is escaped! So the regex can be used to identify a pattern that identifies part of the string, but is not captured. That makes a lot of sense. — Westworld, Aug 21 '19 at 10:27

score 2 · Accepted Answer · answered Aug 20 '19 at 14:08

Here you go. Let me know what you think. Just use 'string extract' and rename the columns to match. code here:

import pandas as pd
from pandas.compat import StringIO


RawData="""
id;pesticide_residues_found_in_mg/kg_(mrl)
0;Spirotetramat (partial sum) 0.03 (MRL = 2)
1;n/a
2;n/a
3;n/a
4;n/a
5;n/a
6;n/a
7;fluopyram 0.01 (MRL = 0.9)
8;fenpyrazamine 0.02 (MRL = 3)
9;fluopyram 0.05 (MRL = 0.9)
10;acetamiprid 0.03 (MRL = 0.2)
11;cyprodinil 0.04 (MRL = 1.5)
12;fludioxonil 0.02 (MRL = 0.4)
13;fenpyrazamine 0.07 (MRL = 3)
14;thiacloprid 0.02 (MRL = 0.7)
15;acetamiprid 0.04 (MRL = 0.2)
16;chlorothalonil 0.03 (MRL = 6)
17;cyprodinil 0.1 (MRL = 1.5)
18;fludioxonil 0.03 (MRL = 0.4)
19;pyrimethanil 0.09 (MRL = 1)

"""
df = pd.read_csv(StringIO(RawData), sep=";")
df=df['pesticide_residues_found_in_mg/kg_(mrl)'].str.extract(r'(.*)\s(\d[\d.]*)\s+\(MRL\s*=\s*(\d[\d.]*)\)')
df.rename(columns={0:'pesticide name',1:'amount detected',2:'MRL'},inplace=True)
df.dropna()

Result below:

    pesticide name  amount detected MRL
0   Spirotetramat (partial sum) 0.03    2
7   fluopyram   0.01    0.9
8   fenpyrazamine   0.02    3
9   fluopyram   0.05    0.9
10  acetamiprid 0.03    0.2
11  cyprodinil  0.04    1.5
12  fludioxonil 0.02    0.4
13  fenpyrazamine   0.07    3
14  thiacloprid 0.02    0.7
15  acetamiprid 0.04    0.2
16  chlorothalonil  0.03    6
17  cyprodinil  0.1 1.5
18  fludioxonil 0.03    0.4
19  pyrimethanil    0.09    1

Python: Using Regex to Extract Information from Pandas Series

1 Answers1