10

I have a dataframe like as shown below

tdf = pd.DataFrame({'text_1':['value: 1.25MG - OM - PO/TUBE - ashaf', 'value:2.5 MG - OM - PO/TUBE -test','value: 18 UNITS(S)','value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had', 'value: 75 MG - OM - PO/TUBE']})

I would like to apply regex and create two columns based on rules given below

col val should store all text after value: and before first hyphen

col Adm should store all text after third hyphen

I tried the below but it doesn't work accurately

tdf['text_1'].str.findall('[.0-9]+\s*[mgMG/lLcCUNIT]+')

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128

3 Answers3

9

With your shown samples, could you please try following.

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\S+(?:\s[^-]+)?)(?:\s-\s.*?-([^-]*)(?:-.*)?)?$', expand=True)
tdf

Online demo for above regex

Output will be as follows.

                                                    text_1          val                  Adm
0                     value: 1.25MG - OM - PO/TUBE - ashaf       1.25MG             PO/TUBE 
1                        value:2.5 MG - OM - PO/TUBE -test       2.5 MG             PO/TUBE 
2                                       value: 18 UNITS(S)  18 UNITS(S)                  NaN
3  value: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS) -had       850 MG   SC (SUBCUTANEOUS) 
4                              value: 75 MG - OM - PO/TUBE        75 MG              PO/TUBE

Explanation: Adding detailed explanation for above.

^value:\s?       ##Checking if value starts from value: space is optional here.
(\S+             ##Starting 1st capturing group from here and matching all non space here.
  (?:\s[^-]+)?   ##In a non-capturing group matching space till - comes keeping it optional.
)                ##Closing 1st capturing group here.
(?:\s-\s.*?-     ##In a non-capturing group matching space-space till - first occurrence.
  ([^-]*)        ##Creating 2nd capturing group which has values till next - here.
  (?:-.*)?       ##In a non capturing group from - till end of value keeping it optional.
)?$              ##Closing non-capturing group at the end of the value here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • @TheGreat, could you please do let me know for which samples it doesn't work. With your shown samples its working fine for me. – RavinderSingh13 Apr 12 '21 at 11:27
  • Oh sorry, I thought you read my comment on the previous answer. I updated my sample dataframe now. – The Great Apr 12 '21 at 11:27
  • @TheGreat, with your latest changes(df) when I test `tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^value:\s?(\d+(?:\.\d+)?\s?(?:\S+)?)(?:\s-[^-]*-(.*))?$', expand=True)` my bonus solution it works fine for me, let me know on same please. – RavinderSingh13 Apr 12 '21 at 11:29
  • Sorry, bonus solution doesn't work with the updated sample – The Great Apr 12 '21 at 11:32
9

Series.str.extract

tdf['text_1'].str.extract(r'^value:\s?([^-]+)(?:\s-.*?-\s)?([^-]*)(?:\s|$)')

             0                  1
0       1.25MG            PO/TUBE
1       2.5 MG            PO/TUBE
2  18 UNITS(S)                   
3       850 MG  SC (SUBCUTANEOUS)
4        75 MG            PO/TUBE

Regex details:

  • ^ : Assert position at start of line
  • value: : Matches character sequence value:
  • \s?: Matches any whitespace character between zero and one time
  • ([^-]+) : First capturing group matches any character except - one or more times
  • (?:\s-.*?-\s)? : Non capturing group match between zero and one time
    • \s: Matches single whitespace character
    • - : Matches character -
    • .*? : Matches any character between zero and unlimited times but as few times as possible
    • - : Matches character -
    • \s : Matches single whitespace character
  • ([^-]*) : Second capturing group matches any character except - zero or more times
  • (?:\s|$) : Non capturing group
    • \s- : Matches single whitespace character
    • | : Or switch
    • $ : Assert position at the end of line

See the online Regex demo

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
5

You can use

tdf[["val", "Adm"]] = tdf["text_1"].str.extract(r'^val:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?$', expand=True)
# => >>> tdf
                                             text_1          val  \
0                        val: 1.25MG - OM - PO/TUBE       1.25MG   
1                         val:2.5 MG - OM - PO/TUBE       2.5 MG   
2                                  val: 18 UNITS(S)  18 UNITS(S)   
3  val: 850 MG - TDS AFTER FOOD - SC (SUBCUTANEOUS)       850 MG   
4                         val: 75 MG - OM - PO/TUBE        75 MG   


0            PO/TUBE  
1            PO/TUBE  
2                NaN  
3  SC (SUBCUTANEOUS)  
4            PO/TUBE  

See the regex demo.

Details:

  • ^val: - val: at the start of string (if val: is not always at the start of the string, remove ^ anchor)
  • \s* - zero or more whitespaces
  • ([^-]*?) - Group 1: any chars other than - as few as possible
  • (?:\s*-[^-]*-\s*(.*))? - an optional sequence of
    • \s* - zero or more whitespaces
    • -[^-]*- - a -, any zero or more chars other than -, and then a -
    • \s* - zero or more whitespaces
    • (.*) - Group 2: the rest of the line
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • one minor question. Let's say if I wish to reorder the columns. Meaning, `Adm` should come first and `val` as the last column. Would the regex remain the same? – The Great Apr 12 '21 at 10:16
  • Sorry, couldn't try as I am away from my desk – The Great Apr 12 '21 at 10:16
  • @TheGreat After extracting, you can [reorder the columns](https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns), add the `tdf = tdf[['text_1', 'Adm', 'val']]` line. – Wiktor Stribiżew Apr 12 '21 at 10:55
  • Thanks, one last question. I am trying to change your regex to pick all text after `3rd hyphen` but before `4th hyphen`... So, I write the below `tdf["text_1"].str.extract(r'^value:\s*([^-]*?)(?:\s*-[^-]*-\s*(.*))?[^-]*', expand=True)` but that seems to give incorrect output. would you be able to help? – The Great Apr 12 '21 at 11:08
  • 1
    @TheGreat That will be `^val:\s*([^-]*?)(?:\s*-[^-]*-\s*([^-]*))`, see [demo](https://regex101.com/r/ZTTzb3/2). – Wiktor Stribiżew Apr 12 '21 at 11:33