1

I want to split a text into subsentence. How can I do that?

For example:

text = "Hi, this is an apple. Hi, that is pineapple."

The result should be:

['Hi,',
 'this is an apple.',
 'Hi,',
 'that is pineapple.']

(P.S. I tried with the string.split(r'[,.]'), but it will remove the separators.)

xirururu
  • 5,028
  • 9
  • 35
  • 64
  • So you just want to split by any punctuation? – user3483203 Dec 24 '17 at 22:26
  • @chris I think, it should be an "," in addition to sentence tokenizer – xirururu Dec 24 '17 at 22:29
  • What about an array of arrays where each sentence is broken down into its tokens? – user3483203 Dec 24 '17 at 22:30
  • @chris what means an array of arrays? can you give me a detailed example? – xirururu Dec 24 '17 at 22:33
  • Per your example: `[['Hi,', 'this is an apple.'], ['Hi,', 'that is pineapple.']]` I am not saying there is anything at all wrong with what you are trying to do, just suggesting something that would allow you to keep track of sentences as well as tokens within each of those sentences. – user3483203 Dec 24 '17 at 22:34
  • @chris I think, it is also ok. The most important is I want to keep the "," and "." – xirururu Dec 24 '17 at 22:39
  • 1
    OK. If you wanted to do that you can just do `[h.split(',') for h in text.split('.') if h != '']` otherwise the given answers will help you! – user3483203 Dec 24 '17 at 22:40
  • @chris, thanks, I think, this is also a good way to keep track of sentences. :D – xirururu Dec 24 '17 at 22:51

5 Answers5

4

Maybe this could work too:

text.replace(', ', ',, ').replace('. ', '., ').split(', ')

Results in:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
3

Related question

The Natural Language Toolkit provides a tokenizer that you can use to split sentences. For example:

>>> import nltk
>>> nltk.download()   # enter "punkt"

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> data = "Hi, this is an apple. Hi, that is pineapple."
>>> data = data.replace(',', '.')
>>> tokenizer.tokenize(data)
['Hi.', 'this is an apple.', 'Hi.', 'that is pineapple.']

Details of the tokenizer are documented here.

James Lim
  • 12,915
  • 4
  • 40
  • 65
2

You could split on whitespace \s+ with a zero-length look-behind assertion (?<=[,.]) for the punctuation.

import re

text = "Hi, this is an apple. Hi, that is pineapple."
subsentence = re.compile(r'(?<=[,.])\s+')

print(subsentence.split(text))

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']

Ryan Stein
  • 7,930
  • 3
  • 24
  • 38
0

Here is another possible solution using re.finditer():

import re

text = "Hi, this is an apple. Hi, that is pineapple."

punct_locs = [0] + [i.start() + 1 for i in re.finditer(r'[,.]', text)]

sentences = [text[start:end].strip() for start, end in zip(punct_locs[:-1], punct_locs[1:])]

print(sentences)

Which outputs:

['Hi,', 'this is an apple.', 'Hi,', 'that is pineapple.']
RoadRunner
  • 25,803
  • 6
  • 42
  • 75
0

Why you are making it too much complex by importing heavy modules , just go with simple and clean method without importing any module :

text = "Hi, this is an apple. Hi, that is pineapple."
for i in text.split('.'):
    if i:
        print(i.strip().split(','))

output:

['Hi', ' this is an apple']
['Hi', ' that is pineapple']

You can do in one line :

text = "Hi, this is an apple. Hi, that is pineapple."
print([i.strip().split(',') for i in text.split('.') if i])

output:

[['Hi', ' this is an apple'], ['Hi', ' that is pineapple']]
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88