1

The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.

My attempt:

string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."

pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")

matches = pattern.finditer(string)
for match in matches:
    print('-', match.group(0))

The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$).

Here is the same string over multiple lines, just for reference:

Q: This is a question. 
Q: This is a 2nd question 
on two lines. 

A: This is an answer. 
A: This is a 2nd answer 
on two lines.
Q: Here's another question. 
A: And another answer.

So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.

If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.

rv.kvetch
  • 9,940
  • 3
  • 24
  • 53
  • Do you need to map each question to the right answer? Are they all following in set order? – Wiktor Stribiżew Oct 28 '21 at 14:39
  • @WiktorStribiżew Originally (when each Q: and A: was one line) I would go through and get all the Qs first into a list, then all the As. So the correct Qs and As would all have matching index numbers. – Adriel Bradley Oct 28 '21 at 14:53
  • Then use two separate regexps and then zip the outputs. Use my regex I shared in the comments. I could provide an answer but I am on a mobile now. – Wiktor Stribiżew Oct 28 '21 at 15:01

2 Answers2

1

I suggest just using a for-loop for this as it's easier for me at least. To answer your question, why not just target until the period rather than the next A: | Q:? You'd probably have to use lookaheads otherwise.

(A: |Q: )[\s\S]*?\.

[\s\S] (Conventionally used to match every character though [\w\W] work as well)

*? is a lazy quantifier. It matches as few characters as it can. If we had just (A: |Q: )[\s\S]*?, then it'd only match the (A: |Q: ), but we have the ending \..

\. matches a literal period.

For the for-loop:

questions_and_answers = []
for line in string.splitlines():
    if line.startswith(("Q: ", "A: ")):
        questions_and_answers.append(line)
    else:
        questions_and_answers[-1] += line

# ['Q: This is a question. ', 'Q: This is a 2nd question on two lines. ', 'A: This is an answer. ', 'A: This is a 2nd answer on two lines.', "Q: Here's another question. ", 'A: And another answer.']```
HelixAchaos
  • 131
  • 1
  • 3
  • Unfortunately I can't use a period because the text might not include a period. But according to you and the other answer it seems like lookaheads is what I was looking for, so thank you. – Adriel Bradley Oct 28 '21 at 14:59
  • Actually your alternative without re is very good! I'll give it a try. I want to upvote your answer but unfortunately I don't have enough points apparently. – Adriel Bradley Oct 28 '21 at 15:48
1

One approach could be to use a negative lookahead ?! to match a newline followed by an A: | Q: block, as follows:

^([AQ]):(?:.|\n(?![AQ]:))+

You can also try it out here on the Regex Demo.

Here's another approach suggested by @Wiktor that should be a little faster:

^[AQ]:.*(?:\n+(?![AQ]:).+)*

A slight modification where we match .* instead of like \n+ (but note that this also captures blank lines at the end):

^[AQ]:.*(?:\n(?![AQ]:).*)*
rv.kvetch
  • 9,940
  • 3
  • 24
  • 53
  • 1
    I think `(?m)^[AQ]:.*(?:\n(?![AQ]:).+)*` would be a [much faster](https://regex101.com/r/jWpoog/2) pattern for what you tried to achieve with yours. – Wiktor Stribiżew Oct 28 '21 at 14:37
  • @WiktorStribiżew I updated the test string I was using but I noticed that lines with a newline in between don't seem to be matched with that approach; otherwise it does work very well. I'm not sure how the OP wanted to handle such cases though. – rv.kvetch Oct 28 '21 at 14:41
  • 1
    You should never use an alternation of `.` and whitespace/line break patterns. There are too [many issues](https://stackoverflow.com/a/31294276/3832970) related to that pattern. Simply use `re.DOTALL` to make `.` match line breaks. – Wiktor Stribiżew Oct 28 '21 at 14:42
  • @rv.kvetch Thank you, I think this is it! Although I'm confused why it's matching blank lines too? I guess I need to learn about negative lookahead now, because the pattern makes no sense to me and I don't know how to edit it. lol Anyway, thank you! – Adriel Bradley Oct 28 '21 at 14:56
  • @AdrielBradley actually it's not matching blank lines by itself, but rather only when there's new lines between a multi-line Q/A for example. I wasn't sure how we would need to handle cases like those. – rv.kvetch Oct 28 '21 at 14:59
  • @WiktorStribiżew I figured it out! if you need to match a Q/A with a newlines in between, you'd need to change `\n` in above to `\n+` to handle those cases. – rv.kvetch Oct 28 '21 at 15:01
  • 1
    Maybe. In my regex, to match empty lines, you need to replace `.+` with `.*`. – Wiktor Stribiżew Oct 28 '21 at 15:04
  • 1
    @WiktorStribiżew rv.kvetch Great! You've both been very helpful because I would definitely like to save blank lines if possible. Trailing ones don't matter but there might be some in the middle of a Q/A – Adriel Bradley Oct 28 '21 at 15:19
  • @AdrielBradley glad to know it helped! also, you can use the 2nd approach above (provided by Wiktor) which definitely seems to exclude trailing newlines in the result. – rv.kvetch Oct 28 '21 at 15:23
  • 1
    I have just published a [YT video](https://www.youtube.com/watch?v=SEobSs-ZCSE) about the evil `(?:\s|.)*` pattern. – Wiktor Stribiżew Oct 28 '21 at 16:00