2

My goal is to identify abbreviation word that appears right after @PROG$ and change it to @PROG$. (eg. ALI -> @PROG$)

Input

s = "Background (UNASSIGNED): Previous study of ours showed that @PROG$ (ALI) and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."

Output

"Background (UNASSIGNED): Previous study of ours showed that @PROG$ @PROG$ and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."

I tried something like this re.findall('(\(.*?\))', s) which gave me all the abbreviations. Any help from here? what I need to fix?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
JJ_K
  • 23
  • 5
  • So you've tried to find all text in parentheses. What have you tried to replace or `.sub`stitute them? Your example matches anything, what have you tried to match something specific? How would you get a regex to lookbehind itself and match only the one you need? – Grismar Dec 21 '20 at 21:38

1 Answers1

1

You can use a re.sub solution like

import re
s = "Background (UNASSIGNED): Previous study of ours showed that @PROG$ (ALI) and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients."
print( re.sub(r'(@PROG\$\s+)\([A-Z]+\)', r'\1@PROG$', s) )
# => Background (UNASSIGNED): Previous study of ours showed that @PROG$ @PROG$ and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients.

See the Python demo. The regex is

(@PROG\$\s+)\([A-Z]+\)

See the regex demo. Details:

  • (@PROG\$\s+) - Group 1 (\1 refers to this group value from the replacement pattern): @PROG$ and one or more whitespaces
  • \( - a ( char
  • [A-Z]+ - one or more uppercase ASCII letters (replace with [^()]* to match anything in between parentheses except for ( and ))
  • \) - a ) char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thanks for helping out. So my string actually has ALI throughout the sentence. Is there a way to identify "ALI" and turn this into @PROG$ as well? – JJ_K Dec 21 '20 at 21:44
  • eg. s = "Background (UNASSIGNED): Previous study of ours showed that @PROG$ (ALI) and C-reactive protein (CRP) are independent significant prognostic factors in operable non-small cell lung cancer (NSCLC) patients. Since both ALI and CRP are markers of inflammation, the aim of this study was to examine whether the combination of ALI and CRP is a prognostic indicator of resected NSCLC or not." – JJ_K Dec 21 '20 at 21:45
  • @JJ_K You can add `|\bALI\b` to the regex. See [this regex demo](https://regex101.com/r/5FCleQ/4). – Wiktor Stribiżew Dec 21 '20 at 21:48
  • so I want to use a formalized regex & apply to different sentences so having "ALI" can't be applied to other sentences. For eg. I have other string such as s2 = "Based on the combination of @PROG$ (CRP), patients low CRP were assigned an CRP score of 0." – JJ_K Dec 21 '20 at 22:05
  • so i want one regular expression that can identify the abbreviation word right after @PROG$ and find that word throughout the string and label those as @PROG$ as well. Is this feasible? – JJ_K Dec 21 '20 at 22:06
  • @JJ_K It is not. Do it in two steps: 1) extract the abbrev inside parentheses after `@PROG$`, then 2) replace all its occurrences. – Wiktor Stribiżew Dec 21 '20 at 22:20
  • I don't quite get how to do step 1. Using (@PROG\$\s+)\([A-Z]+\) do I save this as a variable and replace it? I've tried m = re.findall('(@PROG\$\s+)\([A-Z]+\)',s) & re.sub() but doesn't really work – JJ_K Dec 21 '20 at 22:29
  • @JJ_K I am somewhat unaware of what the actual requirements are, but just guessing there can be any amount of such abbreviations in the string and that you do not care about cases where the abbrev can appear with a single parenthesis, either on the left or on the right, you might use [this Python solution](https://ideone.com/1iHxBl). – Wiktor Stribiżew Dec 21 '20 at 22:37
  • Thank you for helping! just a quick q, what does "fr" mean in pattern = fr'\(?\b(?:{"|".join(matches)})\b\)?' – JJ_K Dec 21 '20 at 22:49
  • @JJ_K Please see [this question](https://stackoverflow.com/questions/12871066) and [this one](https://stackoverflow.com/questions/43123408/f-strings-vs-str-format). These are raw string literals allowing variable "expansion", or interpolation. – Wiktor Stribiżew Dec 21 '20 at 22:52
  • thanks for the explanation. So I'm encoutering a case where there's no paranthese next to @PROG$ but reg exp changes all the string to @PROG$ and generates a mess. Can you help me why? In the case of not having () after @prog$ I just want to skip the string and go to the next string – JJ_K Dec 22 '20 at 01:36
  • this is what I tried: s = "The objective of this study was to assess the prognostic significance of @PROG$ in non-small cell lung cancer (NSCLC)." matches = re.findall(r'@PROG\$\s*\(([A-Z]+)\)', s) pattern = fr'\(?\b(?:{"|".join(matches)})\b\)?' sentences = re.sub(pattern, '@PROG$', s) – JJ_K Dec 22 '20 at 01:36
  • @JJ_K I see, just check if there are any matches, see [this Python demo](https://ideone.com/JaN5Gq). – Wiktor Stribiżew Dec 22 '20 at 09:43
  • I just tried the solution you provided, however it I change the string to s = "The objective of this study was to assess the prognostic significance of @PROG$ (ALI)in non-small cell lung cancer (NSCLC)." this doesn't seem to change (ALI) to @PROG$ – JJ_K Dec 22 '20 at 17:11
  • @JJ_K Because I copied the code you pasted into the comment. SO removes backslashes before parentheses if you do not enclose code in backticks. [Here is the correct code](https://ideone.com/aEXCw9). – Wiktor Stribiżew Dec 22 '20 at 18:43