0

I'm programming a telegram Bot helping me learn German.

Instead of translating the entire paragraph, I would like to translate step by step each sentence, with its translation following immediately, so that I can confront words and learn, instead of keep scrolling up and down.

I'm a regex newbie.

I would like to know if there exist such one.

My text to split into sentences could be like this:

This is a sentence.
This is another. And here one another, same line, starting with space.
this sentence starts with lowercase letter.
Here is a site you may know: google.com.

I would like to get an array containing something like (I'm here writing one element of the array per row you are seeing now):

This is a sentence.
This is another. 
And here one another,same line, starting with space.
this sentence starts with lowercase letter.
Here is a site you may know: google.com.

Thanks indeed in advance.

Jacquelyn.Marquardt
  • 602
  • 2
  • 12
  • 30
  • Of course this has been asked before. Does [this](http://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing) help? – jakeehoffmann Apr 02 '17 at 21:20
  • Even natural language parsing will have trouble finding sentences. Given that, It's not something regex can do. Reason ? Regex parses characters, not words, phrases, sentence structure, nor anything to do with a language, it's usage, etc.. –  Apr 02 '17 at 22:33

1 Answers1

0

This is very likely better handled with nltk (having installed it correctly, that is):

from nltk.tokenize import sent_tokenize

string = "This is a sentence. This is another. And here one another, same line, starting with space. this sentence starts with lowercase letter. Here is a site you may know: google.com."

sent_tokenize_list = sent_tokenize(string)
print(sent_tokenize_list)
# ['This is a sentence.', 'This is another.', 'And here one another, same line, starting with space.', 'this sentence starts with lowercase letter.', 'Here is a site you may know: google.com.']
Jan
  • 42,290
  • 8
  • 54
  • 79