1

I am basically trying to extract Section references from a long document.

The following code does so quite well:

example1 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', example1)
res.group(0)

Output: 'Sections 21(1), 54(2), 78(1)'

However, frequently the sections refer to outside books and I would like to either indicate those or exclude them. Generally, the section reference is followed by an "of" if it refers to another book (example below):

example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'

So in this case, I would like to exclude these sections because they refer to Harry Potter and not to sections within the document. The following should achieve this but it doesn't work.

example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*)(?!\s+of)', example2)
res.group(0)

Expected output: Sections 21(1), 54(2), 78 --> (?!\s+of) removes the (1) behind 78 but not the entire reference.

baduker
  • 19,152
  • 9
  • 33
  • 56
Mia
  • 559
  • 4
  • 9
  • 21

2 Answers2

1

You can emulate atomic groups with capturing groups and lookahead:

(?=(?P<section>Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*))(?P=section)(?! of)

Demo

Long story short: * in positive lookahead you create a capturing group called section that finds a section pattern * then you match the group contents in (?P=secion) * then in negative lookahead you check that there is no of following

Here is a really good answer that explains that technique.

mrzasa
  • 22,895
  • 11
  • 56
  • 94
  • I am trying to use this with re.findall (since there are multiple references per section) but somehow it duplicates the answer in strange ways. example1 = 'Sections 23, 24 and 5' res = re.findall(r'(?=(?P
    Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*))(?P=section)(?! of)', example1) Ouput: [('Sections 23, 24', '23', '', ', 24', '24', '')]
    – Mia Mar 12 '18 at 13:57
  • there are multiple capturing group in the regex, change them to non-capturing groups (use `(:?...)` instead of `(...)`) – mrzasa Mar 12 '18 at 14:01
  • I don't use "..." anywhere, right? Where exactly would I add (:?...)? – Mia Mar 12 '18 at 14:06
  • Yes, replace `(...)` with `(:?...)` – mrzasa Mar 12 '18 at 14:12
  • I meant in the regex. I [tried](https://regex101.com/r/rEzaso/5) but it does not work too good. You can also just choose the first element of the tuple that you have in the output (that's the full match). – mrzasa Mar 12 '18 at 14:25
  • One very last question (sorry for the bother): How would I accomodate for an "and", for example 'Sections 26(1), 51(1) and 62(1)"? – Mia Mar 12 '18 at 14:51
  • You can add it as alernative to a comma. Please use the regex101 to try it, I'm sure you'll manage :) – mrzasa Mar 12 '18 at 14:53
0

This is because after (?!\s+of) fails, it backtracks before optional (\(..\))? which matches because negative lookahead doesn't match.

Atomic group could be used with other regex engines but isn't implemented in python re.

Other solution is to use a possessive quantifier + after ? optional part :

r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?+)*)(?!\s+of)'

note the + after ?

Nahuel Fouilleul
  • 18,726
  • 2
  • 31
  • 36