-2

I'm trying to split a text into sentences, whenever a terminal punctuation mark ( '.', '!', '?') appears. for instance if I have the following text :

Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies. International corporates like EMC have also established major centers in the park, leading the way for others to follow! On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future. this is really interesting! what do you think?

This should be splitted into 5 sentences (see the bold words above, as these words end with a punctuation mark).

Here's my code:

# split on: '.+'
    splitted_article_content = []
    # article_content contains all the article's paragraphs
    for element in article_content:
        splitted_article_content = splitted_article_content +re.split(".(?='.'+)", element)

    # split on: '?+'
    splitted_article_content_2 = []
    for element in splitted_article_content:
        splitted_article_content_2 = splitted_article_content_2 + re.split(".(?='?'+)", element)

    # split on: '!+'
    splitted_article_content_3 = []
    for element in splitted_article_content_2:
            splitted_article_content_3 = splitted_article_content_3 + re.split(".(?='!'+)", element)

My question is, is there any other efficient way to do the following, without using any external libraries ?

Thanks for the help guys.

Jay
  • 717
  • 11
  • 37
  • 2
    But... you _aren't_ using any external libraries as `re` is a part of Python's Standard Library. – ForceBru Mar 28 '17 at 16:48
  • Couldn't you just do `re.split(r'[\.!?] ', article)`? – gen_Eric Mar 28 '17 at 16:49
  • @RocketHazmat article_content is a list of paragraphs.. split will work on a list ? – Jay Mar 28 '17 at 16:51
  • Then try: `splitted_article_content = [re.split(r'[\.!?] ', element) for element in article_content]`. Point is, you only need *one* regex, not three. Or is there any specific reason you need to split it up into 3 lists? Like does it matter if the sentence ended in a `.`, `?` or `!`? – gen_Eric Mar 28 '17 at 16:53

1 Answers1

2

I guess I see this as more of a look behind than a look ahead:

import re

# article_content contains all the article's paragraphs
# in this case, a single paragraph.

article_content = ["""Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies. International corporates like EMC have also established major centers in the park, leading the way for others to follow! On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future. This is really interesting! What do you think?"""]

split_article_content = []

for element in article_content:
    split_article_content += re.split("(?<=[.!?])\s+", element)

print(*split_article_content, sep='\n\n')

OUTPUT

% python3 test.py
Recognizing the rising opportunity Jerusalem Venture Partners opened up their Cyber Labs incubator, giving a home to many of the city’s promising young companies.

International corporates like EMC have also established major centers in the park, leading the way for others to follow!

On a visit last June, the park had already grown to two buildings with the ground being broken for the construction of more in the near future.

This is really interesting!

What do you think?
% 
cdlane
  • 40,441
  • 5
  • 32
  • 81
  • I need to write this input to a file, how can I do that ? – Jay Mar 28 '17 at 17:11
  • @JayMar, please be a little more specific, do you want to input the paragraphs from a file or do you want to output the sentences to a file? If this is a separate question from the one you asked, you might consider posting a new question with your accumulated code. – cdlane Mar 28 '17 at 17:49
  • that's ok I have managed to output the sentences to a file. Thanks for the help. – Jay Mar 28 '17 at 19:18
  • can I exclude dots in between numbers e.g "Pi is 3.14." this should not be splitted into two sentences but one. – Jay Mar 29 '17 at 03:51
  • Did you try it? The code I provided works fine for this example as it doesn't split on periods, it splits on white space preceeded by a period. So no problem. However, a number of the strange form, "I walked 5. feet backwards." is a problem. Your problem. – cdlane Mar 29 '17 at 06:50