2

I'm trying to split a piece sample text into a list of sentences without delimiters and no spaces at the end of each sentence.

Sample text:

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?

Into this (desired output):

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

My code is currently:

def sent_tokenize(text):
    sentences = re.split(r"[.!?]", text)
    sentences = [sent.strip(" ") for sent in sentences]
    return sentences

However this outputs (current output):

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']

Notice the extra '' on the end.

Any ideas on how to remove the extra '' at the end of my current output?

Daniel Bourke
  • 396
  • 2
  • 10
  • 19

4 Answers4

12

nltk's sent_tokenize

If you're in the business of NLP, I'd strongly recommend sent_tokenize from the nltk package.

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[
    'The first time you see The Second Renaissance it may look boring.',
    'Look at it at least twice and definitely watch part 2.',
    'It will change your view of the matrix.',
    'Are the human people the ones who started the war?',
    'Is AI a bad thing?'
] 

It's a lot more robust than regex, and provides a lot of options to get the job done. More info can be found at the official documentation.

If you are picky about the trailing delimiters, you can use nltk.tokenize.RegexpTokenizer with a slightly different pattern:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'[^.?!]+')
>>> list(map(str.strip, tokenizer.tokenize(text)))    
[
    'The first time you see The Second Renaissance it may look boring',
    'Look at it at least twice and definitely watch part 2',
    'It will change your view of the matrix',
    'Are the human people the ones who started the war',
    'Is AI a bad thing'
]

Regex-based re.split

If you must use regex, then you'll need to modify your pattern by adding a negative lookahead -

>>> list(map(str.strip, re.split(r"[.!?](?!$)", text)))
[
    'The first time you see The Second Renaissance it may look boring',
    'Look at it at least twice and definitely watch part 2',
    'It will change your view of the matrix',
    'Are the human people the ones who started the war',
    'Is AI a bad thing?'
]

The added (?!$) specifies that you split only when you do not have not reached the end of the line yet. Unfortunately, I am not sure the trailing delimiter on the last sentence can be reasonably removed without doing something like result[-1] = result[-1][:-1].

Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
  • "without delimiters". Look at the desired output. – whackamadoodle3000 Feb 20 '18 at 07:24
  • @ᴡʜᴀᴄᴋᴀᴍᴀᴅᴏᴏᴅʟᴇ3000 imo, that is a minor detail. I'll see what I can do though. – cs95 Feb 20 '18 at 07:25
  • @ᴡʜᴀᴄᴋᴀᴍᴀᴅᴏᴏᴅʟᴇ3000 I've added an option with RegexpTokenizer which addresses this concern. Hope it's alright now! – cs95 Feb 20 '18 at 07:34
  • I have to use regex, didn't have access to nltk package. The regex answer worked but left '?' at the end of the final sentence. – Daniel Bourke Feb 20 '18 at 07:41
  • @DanielBourke A shame, really. But I can respect that. Good luck! – cs95 Feb 20 '18 at 07:44
  • @DanielBourke, try this: `listy[-1]=listy[-1][:-1]` to remove the ? – whackamadoodle3000 Feb 20 '18 at 07:46
  • @ᴡʜᴀᴄᴋᴀᴍᴀᴅᴏᴏᴅʟᴇ3000 While that would solve the problem, it would also mean that two operations are still required to get OP's output. In that case, one may as well stick to what OP is currently doing, and then use your answer. Maybe improve upon it with `del result[-1]` (basically a more efficient version of your answer) to do the same thing. By the way, if you want to add `del result[-1]` in your answer, feel free to. – cs95 Feb 20 '18 at 07:48
  • Sure, thanks for the tip. – whackamadoodle3000 Feb 20 '18 at 07:50
  • 1
    `(?<!$)` is a lookbehind (see Option 2 description). You might want to use a lookahead, but `(?<!$)` = `(?!$)` since `$` is a zero-width assertion. – Wiktor Stribiżew Feb 20 '18 at 08:04
  • @WiktorStribiżew Thanks for the input, appreciated. That was an oversight on my part. – cs95 Feb 20 '18 at 08:05
3

You can use filter to remove the empty elements

Ex:

import re
text = """The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
def sent_tokenize(text):
    sentences = re.split(r"[.!?]", text)
    sentences = [sent.strip(" ") for sent in sentences]
    return filter(None, sentences)

print sent_tokenize(text)
Rakesh
  • 81,458
  • 17
  • 76
  • 113
1

Any ideas on how to remove the extra '' at the end of my current output?

You could remove it by doing this:

sentences[:-1]

Or faster (by ᴄᴏʟᴅsᴘᴇᴇᴅ)

del result[-1]

Output:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
whackamadoodle3000
  • 6,684
  • 4
  • 27
  • 44
0

You could either strip your paragraph first before splitting it or filter empty strings in the result out.

Tung
  • 1,579
  • 4
  • 15
  • 32