I am basically trying to extract Section references from a long document.
The following code does so quite well:
example1 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*', example1)
res.group(0)
Output: 'Sections 21(1), 54(2), 78(1)'
However, frequently the sections refer to outside books and I would like to either indicate those or exclude them. Generally, the section reference is followed by an "of" if it refers to another book (example below):
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
So in this case, I would like to exclude these sections because they refer to Harry Potter and not to sections within the document. The following should achieve this but it doesn't work.
example2 = 'Sections 21(1), 54(2), 78(1) of Harry Potter'
res = re.search(r'Sections?(\W+(\w+)(\(\w+\))?(, (\w+)(\(\w+\))?)*)(?!\s+of)', example2)
res.group(0)
Expected output: Sections 21(1), 54(2), 78
--> (?!\s+of)
removes the (1)
behind 78
but not the entire reference.