2

I am trying to capture GPA with regex in Python. For example, I want to capture just 3.75 for both lines. The word GPA could be at the beginning or end of 3.75.

GPA 3.75 / 4.0
3.75 / 4.0 GPA

((?:gpa[ :]+)(\d\.?\d{0,2})(?:[/\d\. ]{0,6})?)|((\d\.?\d{0,2})(?:[/\d\. ]{0,6})? ?(?:gpa))

Here is my attempt. It works but I would like to know if there is a way not to repeat the yellow parts. They are exactly the same.

enter image description here

You can see more examples here. Thanks!

Ismael Padilla
  • 5,246
  • 4
  • 23
  • 35
E.K.
  • 4,179
  • 8
  • 30
  • 50
  • 1
    You can use `re.search(r'(gpa[ :]+)?(\d+(?:\.\d+)?)(?(1)| *gpa)', text, re.I).group(2)`, see [the regex demo](https://regex101.com/r/0T2ZW3/2). – Wiktor Stribiżew Feb 24 '21 at 18:54
  • Thanks, I am looking for something like this. But this incorrectly captures `4.0` instead of `3.75` for `3.75 / 4.0 GPA`. – E.K. Feb 24 '21 at 18:58
  • So, `r'(\bgpa[ :]+)?(\d+(?:\.\d+)?(?:[/\d\. ]{0,6})?)(?(1)| *gpa\b)'`? Or, `r'(\bgpa[ :]+)?(\d+(?:\.\d+)?)(?:[/\d\. ]{0,6})?(?(1)| *gpa\b)'`? – Wiktor Stribiżew Feb 24 '21 at 18:58
  • Oh the second one seems to be good! Do you want to post as an answer so that I could accept yours and resolve? – E.K. Feb 24 '21 at 19:00

1 Answers1

1

You can use

m = re.search(r'(\bgpa[ :]+)?(\d+(?:\.\d+)?)[/\d. ]{0,6}(?(1)| *gpa\b)', text, re.I)
if m:
  print(m.group(2))

See the regex demo.

Details:

  • (\bgpa[ :]+)? - an optional capturing group with ID 1: a whole word gpa (case insensitive search is enabled via re.I) and then one or more spaces or colons
  • (\d+(?:\.\d+)?) - Capturing group 2: one or more digits followed with an optional sequence of a dot and one or more digits
  • [/\d. ]{0,6} - zero to six /, digits, dots or spaces
  • (?(1)| *gpa\b) - If Group 1 matches, match an empty string, else, match zero or more spaces and a whole word gpa.

See a Python demo:

import re
texts = ['GPA 3.75/4.00','GPA 3.75/4.0','GPA 3.75 /4.0','GPA 3.75 / 4.0','GPA 3.7 / 4      aaa','Major GPA: 3.6/4.0','GPA 3.1','GPA 3','some text GPA 3','3.25/4.0 GPA','some text 3.26 / 4.0 GPA','Minor in Art and Technology - 3.5/4.0 GPA aaaa','Minor in Art and Technology - 3.5 / 4.3 GPA aaaa']
rx = re.compile(r'(\bgpa[ :]+)?(\d+(?:\.\d+)?)[/\d. ]{0,6}(?(1)| *gpa\b)', re.I)
for text in texts:
  m = rx.search(text)
  if m:
    print(text, "=>", m.group(2), sep=" ")

This outputs:

GPA 3.75/4.00 => 3.75
GPA 3.75/4.0 => 3.75
GPA 3.75 /4.0 => 3.75
GPA 3.75 / 4.0 => 3.75
GPA 3.7 / 4      aaa => 3.7
Major GPA: 3.6/4.0 => 3.6
GPA 3.1 => 3.1
GPA 3 => 3
some text GPA 3 => 3
3.25/4.0 GPA => 3.25
some text 3.26 / 4.0 GPA => 3.26
Minor in Art and Technology - 3.5/4.0 GPA aaaa => 3.5
Minor in Art and Technology - 3.5 / 4.3 GPA aaaa => 3.5
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you, for the capturing group 2, `(\d+(?:\.\d+)?)`, why do I need `?:`? – E.K. Feb 24 '21 at 19:17
  • @E.K. `(?:...)` is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions) used only to group a sequence of patterns. – Wiktor Stribiżew Feb 24 '21 at 19:19
  • Yeah, for the case of `3.75`, I think `\d+` matches `3` and `(?:\.\d+)` captures `.75`, which is also a part I want to capture. So why non-capturing group for `.75`? – E.K. Feb 24 '21 at 19:24
  • @E.K. Since `.75` value is not necessary in the output, there is no need allocating memory for this submatch. – Wiktor Stribiżew Feb 25 '21 at 10:16