I am working parsing some information from Wikipedia and text in the dumps include special annotations for links and images in the shape of {{content}}
or [[content]]
. I want to separate the text into sentences but the problem arises when the point is not followed by a space but by one of the previous symbols.
So, in general, it must split when '. ', '.{{', '.[['
happen.
Example:
prueba = 'Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].'
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', prueba)
I paste the text here again to ease the reading
Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].
The output of this code is a list with only one item containing the whole text:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.[[sfn|Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
But I need to get a list with three items like this:
['Anarchism does not offer a fixed body of doctrine from a single particular worldview.', '{{sfn|Marshall|1993|pp=14–17}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.', '[[Sylvan|2007|p=262]] [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].']
How can I fix my regex code? I tried different solutions but I couldn't get the desired result.
Thanks in advance.