Parse text using regex to extract valid passage

Question

How can I parse text on python using regex to extract a valid passage from stuff like

near accomodation\n\nNear accomodation is one case of accomodation. By changing the shape of the lens, accomodation adjusts the refractory power to the distance of an object under observation. The issue is

I want to extract

Near accomodation is one case of accomodation. By changing the shape of the lens, accomodation adjusts the refractory power to the distance of an object under observation.

That means the valid text should end on a period and get rid of stuff like "The issue is" which is an unfinished sentence as well as anything that comes before characters like \n.

Another example would be

<p>The level of dopamine available in nerve terminals is controlled by the enzyme monoamineoxidase, which inactivates the neurotransmitter in the presynapse. </p>\n\n</body></html>

Which should extract

The level of dopamine available in nerve terminals is controlled by the enzyme monoamineoxidase, which inactivates the neurotransmitter in the presynapse.

So get also rid of any html tags

So I need clean passages that end up in a period. Without any newline characters or html tags that could come after or before the relevant passage. All passages are more or less like the examples I provided.

you need to create regex expressions for the different cases you find. Then you need to apply them. Start with reading https://docs.python.org/3/library/re.html then move your text over to http://www.regex101.com and try out your regexxes. Iterate until you are statisfied. — Patrick Artner, Jun 02 '18 at 17:14
By the by - your first example captures 2 sentences - it does not stop at the first `.`. Your second example: you should probably get rid of html _first_ then extract. Html is too complex to be easily handled with regex - some will even tell you to use a html parser instead (which is a good idea). Your first example will also have more text after it - regex is a pattern analasys, no text comprehension thingy, so it will find patterns, not understand which text belongs together. — Patrick Artner, Jun 02 '18 at 17:17
@PatrickArtner Thanks for the explanations. Yeah the period shouldn't be the first, you're correct. But the fact that regex is greedy like the answers explained make it perfect for getting the longest passage. Also I already applied a text comprehension procedure before this so the first example already is accurate enough but "The issue is" should be left out. That's why I need everything to stop on a period to make sure it at least finishes on a complete sentence. — Atirag, Jun 02 '18 at 18:14

score 1 · Accepted Answer · answered Jun 02 '18 at 17:18

1

I propose seperating the removal of HTML tags (which you should not do with regex) from the main task, for example with this solution.

The rest of the task can then be solved with the following regex:

(?:^|\n|\.)(.*\.)

We first match either the beginning of the text (^), a new line or a literal dot. The ?: is just to make this group non-capturing. Then we collect everything until a dot, in a greedy fashion (meaning we get the biggest possible match).

You could use it like this:

import re
m = re.findall(r"(?:^|\n|\.)(.*\.)", your_string)
if m:
    print(m[0].strip())

answered Jun 02 '18 at 17:18

L3viathan

26,748
2
58
81

Worked great! Thanks – Atirag Jun 02 '18 at 17:35
I just found a new example that starts like "are small brain cells with a life-long division. The types of glial cells are astrocytes,". The above re takes everything but it would be better if it takes only the beginning of a new sentence so from "The types of glial cells are astrocytes," on. How could I modify the above expression to tell it to look for patterns that start with capital letter? – Atirag Jun 05 '18 at 09:02
1

Change the regex to `(?:^|\n|\. )(?=[A-Z])(.*\.)` in that case, see [this demo](https://regex101.com/r/VmW2oY/1). – L3viathan Jun 05 '18 at 10:36

score 1 · Answer 2 · answered Jun 02 '18 at 17:51

The key is to be able to precisely state the conditions that:

Start the match
Continue the match
End the match

In your case, these seem to be

An upper-case letter. [A-Z]
Not any char from '\n', '<' and so on, repeated. [^\n<>]+
A full stop. \.

Since regexs are greedy by default, the ending condition will apply on a longest match and so get multiple sentences that don't contain the continuation condition. This gives the regex [A-Z][^\n<>]+\.:

>>> import re
>>> matcher = re.compile('[A-Z][^\n<>]+\.')

Using what you provided:

>>> matcher.findall('''<p>The level of dopamine available in nerve terminals is controlled by the enzyme monoamineoxidase, which inactivates the neurotransmitter in the presynapse. </p>\n\n</body></html>''')[0]
'The level of dopamine available in nerve terminals is controlled by the enzyme monoamineoxidase, which inactivates the neurotransmitter in the presynapse.'
>>> matcher.findall('''near accomodation\n\nNear accomodation is one case of accomodation. By changing the shape of the lens, accomodation adjusts the refractory power to the distance of an object under observation. The issue is''')[0]
'Near accomodation is one case of accomodation. By changing the shape of the lens, accomodation adjusts the refractory power to the distance of an object under observation.'

Feel free to adapt as needed.

Parse text using regex to extract valid passage

2 Answers2