0

I have the content of an English dictionary at hand and I want to find the definition for a specific example sentence.

For example, I want to find the definition for "example sentence 2b". In my opinion, the code may look lile this:

re.search(r'\d\. ([^\n]*?)\n(?!.*\d\. ).*?example sentence 2b', content, flags=re.DOTALL)

Here, the "content" is as follows:

1. definition1
example sentence 1a
example sentence 1b
2. definition2
example sentence 2a
example sentence 2b
3. definition3
example sentence 3a
example sentence 3b

Live test here - https://regex101.com/r/UOz6DA/1/

As you can see in the live test, I didn't get desired match - "definition2". I really don't know why.

PS: I used (?!.*\d\. ).* based on this post - regex how to exclude specific characters or string anywhere

wbzy00
  • 146
  • 9
  • By default, [dot '.' doesn't match newline](https://docs.python.org/3/library/re.html), hence `'.*?'` won't match the newline before *"example sentence 2b"*. Either use `re.DOTALL` flag, or put explicit `\n`'s in your regex wherever newlines can occur. There are many existing Q&A on SO about this. – smci Jun 09 '21 at 00:04
  • @smci But I did use this flag, which is indicated by the "s" to the right of the regex on regex101.com. I have got my answer down below, though. – wbzy00 Oct 09 '21 at 07:05

3 Answers3

2

You may use the following pattern without the re.DOTALL flag:

^\d+\. (.*)(?:\n(?!\d+\. ).*)*\nexample sentence 2b

Regex demo.

Breakdown:

  • ^ - Beginning of line.
  • \d+\. - Match one or more digits, then a dot, and a space character.
  • (.*) - Match zero or more characters and capture them in group 1.
  • (?: - Beginning of a non-capturing group.
    • \n(?!\d+\. ) - Match a line-break that is not followed by a "definition line".
    • .* - Match zero or more characters.
  • ) - Close the non-capturing group.
  • *? - Match the previous group between zero and unlimited times (lazy).
  • \nexample sentence 2b - Match a linebreak character followed by the target sentence.
  • It only works for the second example sentence. For example, for "example sentence 2a", it won't work anymore. That's the reason I find it necessary to use the "s" flag (re.DOTALL). – wbzy00 Oct 08 '21 at 01:27
  • 1
    @wbzy00 `re.DOTALL` is still irrelevant. If you need to search for any example sentence (not necessarily the second one), then you could just add a `*` or a `*?` after the non-capturing group. See [this demo](https://regex101.com/r/EkdFFU/1). – 41686d6564 stands w. Palestine Oct 08 '21 at 02:03
  • Thank you! You are a hero. After putting so much effort into this problem recently, I finally find the answer. – wbzy00 Oct 09 '21 at 06:42
0

You are missing the \n character to match break line. enter image description here

Mark Rofail
  • 808
  • 1
  • 8
  • 18
  • It only works for the second example sentence. For example, for "example sentence 2a", it won't work anymore. That's the reason I find it necessary to use the "s" flag (re.DOTALL). – wbzy00 Oct 08 '21 at 01:27
0

The reason it won't match is due to the existence of "3. ", even though this substring is after "example sentence 2b".

For a simpler example, if you use the "s" flag in this live demo, the second line won't match any more because of the "chocolate" substring in the third line.

wbzy00
  • 146
  • 9