Python Regex to Capture Proceeding Text - mixing cas insensitivity in group

Question

RegEx Group returning issue:

(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?

Objective: To extract text from proceeding transcript pdfs for each question/answer/speaker type.

Using Python: interage through pages in PDF extracted text and group Qestion/Answer text.

Desired Results = qa_type, page_start, page_end, line_num_start, line_num_end, qa_text

ISSUE: For the [Q|A] designators, I only want upper case, but for the speaker Titles (Mr, Mrs., Dr., etc.) case insensitive is required, both Q|A and spearker salutation a single 'qa_type' group.

Request: How do I prevent 'qa_type' from captureing 'a' or 'q'? See lines 2 and 17 on pp 275.

Example bad extract - line 17 'a'

regex = r"(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?(?P<type_text>\b.*)|page (?P<page_num>\d{1,3})"

That final RegEx seems very complicated. Big RegExs are hard to maintain when changes are needed. I suggest using several simpler RegExs. The character classes seem strange; note that `[.|:|\s]` is the same as `[|.:\s]`. There are several uses of `[.|:]`, are the three characters allowed, or is the `|` intended to mean `.` or `:`? A `|` means itself within a character class (i.e. between square brackets). — AdrianHHH, Mar 19 '23 at 09:12
@AdrianHHH, thank you for the feedback as this is very helpful. I will refactor this once I get things working. Please recommend any other ideas for a more manageable approach, as I will be using this on millions of PDFs, so there will be variations in parsing syntax. — rnwtenor, Mar 19 '23 at 10:09

score 0 · Accepted Answer · answered Mar 19 '23 at 09:06

This sounds pretty similar to this question. Unfortunately, it seems like python inline flag modifiers have been deprecated. You can still try to use them, in which case your regex would look like this (without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|(?i)Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+(?-i)))?([.|:|\s]+)?(?P<type_text>\b.*)|(?i)page(?-i) (?P<page_num>\d{1,3})

The alternative is to just specify both the lowercase and uppercase characters every time you want a case-insensitive letter (again, without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|[mM][rR][.|:]? [a-zA-Z]+|[mM][rR][sS][.|:]? [a-zA-Z]+|[mM][sS][.|:]? [a-zA-Z]+|[mM][iI][sS][sS][.|:]? [a-zA-Z]+|[dD][rR][.|:]? [a-zA-Z]+))?([.|:|\s]+)?(?P<type_text>\b.*)|[pP][aA][gG][eE] (?P<page_num>\d{1,3})

Updated regex101 link

Thank you, this worked for me. I saw the answer you referenced, but I did not extract the example you provided in the alternative above. It appears that Python 3.11 has removed the inline case-insensitive flag. I look to refactor this to a simpler, more pythonic approach after I get things working. — rnwtenor, Mar 19 '23 at 10:07

Python Regex to Capture Proceeding Text - mixing cas insensitivity in group

1 Answers1