0

I am working parsing some information from Wikipedia and text in the dumps include special annotations for links and images in the shape of {{content}} or [[content]]. I want to separate the text into sentences but the problem arises when the point is not followed by a space but by one of the previous symbols.

So, in general, it must split when '. ', '.{{', '.[[' happen.

Example:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)

I paste the text here again to ease the reading

Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].

The output of this code is a list with only one item containing the whole text:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

But I need to get a list with three items like this:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']

How can I fix my regex code? I tried different solutions but I couldn't get the desired result.

Thanks in advance.

Javi Rando
  • 13
  • 3
  • Why does this need to be a regex? Why not just use `prueba.split('.{{')`? – Triggernometry May 09 '19 at 20:33
  • @Triggernometry it should also be able to work with '. ' and '.[[' – Javi Rando May 09 '19 at 20:37
  • 2
    You may want to edit in an example of a string that contains `'.[['`, and explain why splitting on a period isn't sufficient? From what I can see, it seems like you're making this too complicated, so I assume there must be some odd cases that aren't in this post? – Triggernometry May 09 '19 at 20:41
  • @Triggernometry The text above contains a ``.{{`` and a ``.[[`` which are the ones that aren't being splitted. – Javi Rando May 09 '19 at 21:02

1 Answers1

0

Since it seems you are trying to preserve the delimiter you probably want re.findall(). See this answer https://stackoverflow.com/a/44244698/11199887 which is reproduced below and then adapted to your situation. Using re.findall(), you don't have to worry about the difference between .{{ and . and .[[

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']

In the above example you'd capture not only periods, but also question marks and exclamation points that end sentences. Probably not a lot of sentences ending with exclamation points or question marks on Wikipedia, but I haven't really spent time looking for examples

For your case it looks like this:

prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'

sentences = re.findall('.*?[.!\?]', prueba)

or if you really want to split only on periods.

sentences = re.findall('.*?[.]', prueba)

Output from print(sentences)is:

['Anarchism does not offer a fixed body of doctrine from a single particular worldview.',
 '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.',
 '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
kcontr
  • 343
  • 2
  • 12