Applying regex to pandas column based on different pos of same character

Question

I have a dataframe like as shown below

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

RavinderSingh13 · Answer 1 · 2021-04-12T12:38:08.957

With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.

@TheGreat, could you please do let me know for which samples it doesn't work. With your shown samples its working fine for me. — RavinderSingh13, Apr 12 '21 at 11:27
Oh sorry, I thought you read my comment on the previous answer. I updated my sample dataframe now. — The Great, Apr 12 '21 at 11:27
@TheGreat, with your latest changes(df) when I test `tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\d+(?:\.\d+)?\s?(?:\S+)?)(?:\s-[^-]*-(.*))?$', expand=True)` my bonus solution it works fine for me, let me know on same please. — RavinderSingh13, Apr 12 '21 at 11:29

score 9 · Answer 2 · answered Apr 12 '21 at 16:35

`Series.str.extract`

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

^ : Assert position at start of line
value: : Matches character sequence value:
\s?: Matches any whitespace character between zero and one time
([^-]+) : First capturing group matches any character except - one or more times
(?:\s-.*?-\s)? : Non capturing group match between zero and one time
- \s: Matches single whitespace character
- - : Matches character -
- .*? : Matches any character between zero and unlimited times but as few times as possible
- - : Matches character -
- \s : Matches single whitespace character
([^-]*) : Second capturing group matches any character except - zero or more times
(?:\s|$) : Non capturing group
- \s- : Matches single whitespace character
- | : Or switch
- $ : Assert position at the end of line

See the online Regex demo

score 5 · Accepted Answer · answered Apr 12 '21 at 10:10

5

You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE

See the regex demo.

Details:

^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
\s* - zero or more whitespaces
([^-]*?) - Group 1: any chars other than - as few as possible
(?:\s*-[^-]*-\s*(.*))? - an optional sequence of
- \s* - zero or more whitespaces
- -[^-]*- - a -, any zero or more chars other than -, and then a -
- \s* - zero or more whitespaces
- (.*) - Group 2: the rest of the line
$ - end of string.

answered Apr 12 '21 at 10:10

Wiktor Stribiżew

607,720
39
448
563

one minor question. Let's say if I wish to reorder the columns. Meaning, `Adm` should come first and `val` as the last column. Would the regex remain the same? – The Great Apr 12 '21 at 10:16
Sorry, couldn't try as I am away from my desk – The Great Apr 12 '21 at 10:16
@TheGreat After extracting, you can [reorder the columns](https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns), add the `tdf = tdf[['text_1', 'Adm', 'val']]` line. – Wiktor Stribiżew Apr 12 '21 at 10:55
Thanks, one last question. I am trying to change your regex to pick all text after `3rd hyphen` but before `4th hyphen`... So, I write the below `tdf["text_1"].str.extract(r'^value:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?[^-]*', expand=True)` but that seems to give incorrect output. would you be able to help? – The Great Apr 12 '21 at 11:08
1

@TheGreat That will be `^val:\s*([^-]*?)(?:\s*-[^-]*-\s*([^-]*))`, see [demo](https://regex101.com/r/ZTTzb3/2). – Wiktor Stribiżew Apr 12 '21 at 11:33

Applying regex to pandas column based on different pos of same character

3 Answers3

Series.str.extract

`Series.str.extract`