-1

Example Link

RegEx Group returning issue:

(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?

Objective: To extract text from proceeding transcript pdfs for each question/answer/speaker type.

Using Python: interage through pages in PDF extracted text and group Qestion/Answer text.

Desired Results = qa_type, page_start, page_end, line_num_start, line_num_end, qa_text

ISSUE: For the [Q|A] designators, I only want upper case, but for the speaker Titles (Mr, Mrs., Dr., etc.) case insensitive is required, both Q|A and spearker salutation a single 'qa_type' group.

Request: How do I prevent 'qa_type' from captureing 'a' or 'q'? See lines 2 and 17 on pp 275.

Example bad extract - line 17 'a'

regex = r"(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?(?P<type_text>\b.*)|page (?P<page_num>\d{1,3})"
Roshin Raphel
  • 2,612
  • 4
  • 22
  • 40
rnwtenor
  • 13
  • 2
  • That final RegEx seems very complicated. Big RegExs are hard to maintain when changes are needed. I suggest using several simpler RegExs. The character classes seem strange; note that `[.|:|\s]` is the same as `[|.:\s]`. There are several uses of `[.|:]`, are the three characters allowed, or is the `|` intended to mean `.` or `:`? A `|` means itself within a character class (i.e. between square brackets). – AdrianHHH Mar 19 '23 at 09:12
  • @AdrianHHH, thank you for the feedback as this is very helpful. I will refactor this once I get things working. Please recommend any other ideas for a more manageable approach, as I will be using this on millions of PDFs, so there will be variations in parsing syntax. – rnwtenor Mar 19 '23 at 10:09

1 Answers1

0

This sounds pretty similar to this question. Unfortunately, it seems like python inline flag modifiers have been deprecated. You can still try to use them, in which case your regex would look like this (without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|(?i)Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+(?-i)))?([.|:|\s]+)?(?P<type_text>\b.*)|(?i)page(?-i) (?P<page_num>\d{1,3})

The alternative is to just specify both the lowercase and uppercase characters every time you want a case-insensitive letter (again, without the global case-insensitive flag):

(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|[mM][rR][.|:]? [a-zA-Z]+|[mM][rR][sS][.|:]? [a-zA-Z]+|[mM][sS][.|:]? [a-zA-Z]+|[mM][iI][sS][sS][.|:]? [a-zA-Z]+|[dD][rR][.|:]? [a-zA-Z]+))?([.|:|\s]+)?(?P<type_text>\b.*)|[pP][aA][gG][eE] (?P<page_num>\d{1,3})

Updated regex101 link

rpm
  • 1,266
  • 14
  • Thank you, this worked for me. I saw the answer you referenced, but I did not extract the example you provided in the alternative above. It appears that Python 3.11 has removed the inline case-insensitive flag. I look to refactor this to a simpler, more pythonic approach after I get things working. – rnwtenor Mar 19 '23 at 10:07