How to split text into subsentence with python?

Question

I want to split a text into subsentence. How can I do that?

For example:

text = "Hi, this is an apple. Hi, that is pineapple."

The result should be:

['Hi,',
 'this is an apple.',
 'Hi,',
 'that is pineapple.']

(P.S. I tried with the string.split(r'[,.]'), but it will remove the separators.)

@chris I think, it should be an "," in addition to sentence tokenizer — xirururu, Dec 24 '17 at 22:29
What about an array of arrays where each sentence is broken down into its tokens? — user3483203, Dec 24 '17 at 22:30
@chris what means an array of arrays? can you give me a detailed example? — xirururu, Dec 24 '17 at 22:33
Per your example: `[['Hi,', 'this is an apple.'], ['Hi,', 'that is pineapple.']]` I am not saying there is anything at all wrong with what you are trying to do, just suggesting something that would allow you to keep track of sentences as well as tokens within each of those sentences. — user3483203, Dec 24 '17 at 22:34
@chris I think, it is also ok. The most important is I want to keep the "," and "." — xirururu, Dec 24 '17 at 22:39
OK. If you wanted to do that you can just do `[h.split(',') for h in text.split('.') if h != '']` otherwise the given answers will help you! — user3483203, Dec 24 '17 at 22:40
@chris, thanks, I think, this is also a good way to keep track of sentences. :D — xirururu, Dec 24 '17 at 22:51

score 4 · Answer 1 · answered Dec 24 '17 at 22:41

4

Maybe this could work too:

text.replace(', ', ',, ').replace('. ', '., ').split(', ')

Results in:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

answered Dec 24 '17 at 22:41

Anton vBR

18,287
5
40
46

This solution is so simple and beautiful! – FatihAkici Dec 24 '17 at 23:06
@FatihAkici Yeah, a little hack. But I think the accepted answer is more effective. :) – Anton vBR Dec 24 '17 at 23:10
I agree! It is the proper way but this hack is more beautiful :) – FatihAkici Dec 24 '17 at 23:23

James Lim · Answer 2 · 2017-12-25T04:59:22.650

3

Related question

The Natural Language Toolkit provides a tokenizer that you can use to split sentences. For example:

>>> import nltk
>>> nltk.download()   # enter "punkt"

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> data = "Hi, this is an apple. Hi, that is pineapple."
>>> data = data.replace(',', '.')
>>> tokenizer.tokenize(data)
['Hi.', 'this is an apple.', 'Hi.', 'that is pineapple.']

Details of the tokenizer are documented here.

edited Dec 25 '17 at 04:59

answered Dec 24 '17 at 22:28

James Lim

12,915
4
40
65

this will split the text into two sentences. But, I wish, it split into 4 parts, also with "," – xirururu Dec 24 '17 at 22:37
Ah. Could you replace commas with periods, then use the tokenizer? – James Lim Dec 24 '17 at 22:48
Do you mean the similiar way like @Anton vBR suggested? – xirururu Dec 24 '17 at 22:53
Updated my answer! – James Lim Dec 25 '17 at 00:35

score 2 · Accepted Answer · answered Dec 24 '17 at 22:29

You could split on whitespace \s+ with a zero-length look-behind assertion (?<=[,.]) for the punctuation.

import re

text = "Hi, this is an apple. Hi, that is pineapple."
subsentence = re.compile(r'(?<=[,.])\s+')

print(subsentence.split(text))

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

score 0 · Answer 4 · answered Dec 25 '17 at 03:16

Here is another possible solution using re.finditer():

import re

text = "Hi, this is an apple. Hi, that is pineapple."

punct_locs = [0] + [i.start() + 1 for i in re.finditer(r'[,.]', text)]

sentences = [text[start:end].strip() for start, end in zip(punct_locs[:-1], punct_locs[1:])]

print(sentences)

Which outputs:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

Aaditya Ura · Answer 5 · 2017-12-25T07:53:07.077

Why you are making it too much complex by importing heavy modules , just go with simple and clean method without importing any module :

text = "Hi, this is an apple. Hi, that is pineapple."
for i in text.split('.'):
    if i:
        print(i.strip().split(','))

output:

['Hi', ' this is an apple']
['Hi', ' that is pineapple']

You can do in one line :

text = "Hi, this is an apple. Hi, that is pineapple."
print([i.strip().split(',') for i in text.split('.') if i])

output:

[['Hi', ' this is an apple'], ['Hi', ' that is pineapple']]

How to split text into subsentence with python?

5 Answers5