As some of the comments hinted at, what you are wanting to use is the Punkt sentence segmenter / tokenizer.
NLTK or Language specific?
Neither. As you have realized, you cannot simply split on every period. NLTK comes with several Punkt segmenters trained on different languages. However, if you're having issues your best bet is to use a larger training corpus for the Punkt tokenizer to learn from.
Documentation Links
Sample Implementation
Below is part of the code to point you in the right direction. You should be able to do the same for yourself by supplying Russian text files. One source for that could potentially be the Russian version of a Wikipedia database dump, but I leave that as a potential secondary problem for you.
import logging
try:
import cPickle as pickle
except ImportError:
import pickle
import nltk
def create_punkt_sent_detector(fnames, punkt_fname, progress_count=None):
"""Makes a pass through the corpus to train a Punkt sentence segmenter.
Args:
fname: List of filenames to be used for training.
punkt_fname: Filename to save the trained Punkt sentence segmenter.
progress_count: Display a progress count every integer number of pages.
"""
logger = logging.getLogger('create_punkt_sent_detector')
punkt = nltk.tokenize.punkt.PunktTrainer()
logger.info("Training punkt sentence detector")
doc_count = 0
try:
for fname in fnames:
with open(fname, mode='rb') as f:
punkt.train(f.read(), finalize=False, verbose=False)
doc_count += 1
if progress_count and doc_count % progress_count == 0:
logger.debug('Pages processed: %i', doc_count)
except KeyboardInterrupt:
print 'KeyboardInterrupt: Stopping the reading of the dump early!'
logger.info('Now finalzing Punkt training.')
punkt.finalize_training(verbose=True)
learned = punkt.get_params()
sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(learned)
with open(punkt_fname, mode='wb') as f:
pickle.dump(sbd, f, protocol=pickle.HIGHEST_PROTOCOL)
return sbd
if __name__ == 'main':
punkt_fname = 'punkt_russian.pickle'
try:
with open(punkt_fname, mode='rb') as f:
sent_detector = pickle.load(f)
except (IOError, pickle.UnpicklingError):
sent_detector = None
if sent_detector is None:
corpora = ['russian-1.txt', 'russian-2.txt']
sent_detector = create_punkt_sent_detector(fnames=corpora,
punkt_fname=punkt_fname)
tokenized_text = sent_detector.tokenize("some russian text.",
realign_boundaries=True)
print '\n'.join(tokenized_text)