Python regex does not match string as intended for some reason

Question

I have the content of an English dictionary at hand and I want to find the definition for a specific example sentence.

For example, I want to find the definition for "example sentence 2b". In my opinion, the code may look lile this:

re.search(r'\d\. ([^\n]*?)\n(?!.*\d\. ).*?example sentence 2b', content, flags=re.DOTALL)

Here, the "content" is as follows:

1. definition1
example sentence 1a
example sentence 1b
2. definition2
example sentence 2a
example sentence 2b
3. definition3
example sentence 3a
example sentence 3b

Live test here - https://regex101.com/r/UOz6DA/1/

As you can see in the live test, I didn't get desired match - "definition2". I really don't know why.

PS: I used (?!.*\d\. ).* based on this post - regex how to exclude specific characters or string anywhere

By default, [dot '.' doesn't match newline](https://docs.python.org/3/library/re.html), hence `'.*?'` won't match the newline before *"example sentence 2b"*. Either use `re.DOTALL` flag, or put explicit `\n`'s in your regex wherever newlines can occur. There are many existing Q&A on SO about this. — smci, Jun 09 '21 at 00:04
@smci But I did use this flag, which is indicated by the "s" to the right of the regex on regex101.com. I have got my answer down below, though. — wbzy00, Oct 09 '21 at 07:05

41686d6564 stands w. Palestine · Accepted Answer · 2021-10-08T02:07:09.477

2

You may use the following pattern without the re.DOTALL flag:

^\d+\. (.*)(?:\n(?!\d+\. ).*)*\nexample sentence 2b

Regex demo.

Breakdown:

^ - Beginning of line.
\d+\. - Match one or more digits, then a dot, and a space character.
(.*) - Match zero or more characters and capture them in group 1.
(?: - Beginning of a non-capturing group.
- \n(?!\d+\. ) - Match a line-break that is not followed by a "definition line".
- .* - Match zero or more characters.
) - Close the non-capturing group.
*? - Match the previous group between zero and unlimited times (lazy).
\nexample sentence 2b - Match a linebreak character followed by the target sentence.

edited Oct 08 '21 at 02:07

answered Jun 05 '21 at 13:21

41686d6564 stands w. Palestine

19,168
12
41
79

It only works for the second example sentence. For example, for "example sentence 2a", it won't work anymore. That's the reason I find it necessary to use the "s" flag (re.DOTALL). – wbzy00 Oct 08 '21 at 01:27
1

@wbzy00 `re.DOTALL` is still irrelevant. If you need to search for any example sentence (not necessarily the second one), then you could just add a `*` or a `*?` after the non-capturing group. See [this demo](https://regex101.com/r/EkdFFU/1). – 41686d6564 stands w. Palestine Oct 08 '21 at 02:03
Thank you! You are a hero. After putting so much effort into this problem recently, I finally find the answer. – wbzy00 Oct 09 '21 at 06:42

score 0 · Answer 2 · answered Jun 05 '21 at 13:21

0

You are missing the \n character to match break line.

answered Jun 05 '21 at 13:21

Mark Rofail

808
1
8
18

It only works for the second example sentence. For example, for "example sentence 2a", it won't work anymore. That's the reason I find it necessary to use the "s" flag (re.DOTALL). – wbzy00 Oct 08 '21 at 01:27

score 0 · Answer 3 · answered Nov 20 '21 at 08:51

The reason it won't match is due to the existence of "3. ", even though this substring is after "example sentence 2b".

For a simpler example, if you use the "s" flag in this live demo, the second line won't match any more because of the "chocolate" substring in the third line.

Python regex does not match string as intended for some reason

3 Answers3