2

I have a data frame like as shown below

df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,11,11],
                   'text':['inJECTable 1234 Eprex DOSE 4000 units on NONd',
                           'department 6789 DOSE 8000 units on DIALYSIS days  -  IV Interm',
                           'inJECTable 4321 Eprex DOSE - 3 times/wk on NONdialysis day',
                           'insulin MixTARD  30/70 - inJECTable 46 units',
                           'insulin ISOPHANE -- InsulaTARD  Vial -  inJECTable 56 units  SC SubCutaneous',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times a week  -  IV Intermittent',
                           'jevity liquid - FEEDS PO  Jevity  -  237 mL  -  1 times per day',
                           '1-alfacalcidol DOSE 1 mcg  - 3 times per week  -  IV Intermittent',
                           '1-supported DOSE 1 mcg  - 1 time/day  -  IV Intermittent',
                           '1-testpackage DOSE 1 mcg  - 1 time a day  -  IV Intermittent']})

I would like to remove the words/strings which follow patterns such as 46 units, 3 times a week, 3 times per week, 1 time/day etc.

I was reading about positive and negative look ahead and behind.

So, was trying something like below

[^([0-9\s]*(?=units))]  #to remove terms like `46 units` from the string
[^[0-9\s]*(?=times)(times a day)] # don't know how to make this work for all time variants

time variants ex: 3 times a day, 3 time/wk, 3 times per day, 3 times a month, 3 times/month etc.

Basically, I expect my output to be something like below (remove terms like xx units, xx time a day, xx times per week, xx time/day, xx time/wk, xx time/week, xx times per week, etc)

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128

1 Answers1

3

You can consider a pattern like

\s*\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))

See the regex demo

NOTE: the \d+ matches one or more digits. If you need to match any number, please consider using other patterns for a number in the format you expect, see regular expression for finding decimal/float numbers?, for example.

Pattern details

  • \s* - zero or more whitespace chars
  • \d+ - one or more digits
  • \s* - zero or more whitespaces
  • (?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?)) - a non-capturing group matching:
    • units? - unit or units
    • | - or
    • times? - time or times
    • (?:\s+(?:a|per)\s+|\s*/\s*) - a or per enclosed with 1+ whitespaces, or / enclosed with 0+ whitespaces
    • (?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?) - d or day, or wk or week, or month, or y/yea/yr

If you need to match whole words only, use word boundaries, \b:

\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b

In Pandas, use

df['text'] = df['text'].str.replace(r'\s*\b\d+\s*(?:units?|times?(?:\s+(?:a|per)\s+|\s*/\s*)(?:d(?:ay)?|w(?:ee)?k|month|y(?:ea)?r?))\b', '')
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563