RegEx Group returning issue:
(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?
Objective: To extract text from proceeding transcript pdfs for each question/answer/speaker type.
Using Python: interage through pages in PDF extracted text and group Qestion/Answer text.
Desired Results = qa_type, page_start, page_end, line_num_start, line_num_end, qa_text
ISSUE: For the [Q|A] designators, I only want upper case, but for the speaker Titles (Mr, Mrs., Dr., etc.) case insensitive is required, both Q|A and spearker salutation a single 'qa_type' group.
Request: How do I prevent 'qa_type' from captureing 'a' or 'q'? See lines 2 and 17 on pp 275.
Example bad extract - line 17 'a'
regex = r"(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?(?P<type_text>\b.*)|page (?P<page_num>\d{1,3})"