5

I am trying to use NLTK to parse Russian text, but it does not work on abbreviations and initials like А. И. Манташева and Я. Вышинский.

Instead, it breaks like below:

организовывал забастовки и демонстрации, поднимал рабочих на бакинских предприятиях А.

И.

Манташева.

It did the same when I used russian.pickle from https://github.com/mhq/train_punkt ,
Is this a general NLTK limitation or language-specific?

Community
  • 1
  • 1
  • 1
    `pickle` is just a serialization module from Python stdlib. It doesn't split your text on sentence boundaries. – jfs Dec 30 '12 at 06:22
  • I thought it is for training purpose. anyway, where the issue from, I mean why initials not splitted correctly? all of the other sentences are splitted correctly. – user1870840 Dec 30 '12 at 06:29
  • Your question is incredibly unclear. Like @J.F. said, `pickle` is just serialization--it just spits out whatever was put in. – jdotjdot Dec 30 '12 at 06:47
  • related: [can NLTK/pyNLTK work “per language” (i.e. non-english), and how?](http://stackoverflow.com/q/1795410). It seems you just need to provide an appropriate training set if PunktSentenceTokenizer can understand initials at all. – jfs Dec 30 '12 at 06:58
  • @J.F.Sebastian How can I give it training data in Russian? I was under the impression that nltk saved the Russian trained Punkt as a pickle – user1870840 Dec 30 '12 at 14:33
  • 1
    @user1870840: I don't know where to get training data for Russian. There is no `russian.pickle` available via `nltk.download()` but for example, `t = nltk.data.load('tokenizers/punkt/english.pickle')` works i.e., it returns PunktSentenceTokenizer. btw, the tokenizer also fails on some initials. [The link that you provided](https://github.com/mhq/train_punkt) shows how you could train it on your own data. – jfs Dec 30 '12 at 15:16

2 Answers2

4

As some of the comments hinted at, what you are wanting to use is the Punkt sentence segmenter / tokenizer.

NLTK or Language specific?

Neither. As you have realized, you cannot simply split on every period. NLTK comes with several Punkt segmenters trained on different languages. However, if you're having issues your best bet is to use a larger training corpus for the Punkt tokenizer to learn from.

Documentation Links

Sample Implementation

Below is part of the code to point you in the right direction. You should be able to do the same for yourself by supplying Russian text files. One source for that could potentially be the Russian version of a Wikipedia database dump, but I leave that as a potential secondary problem for you.

import logging
try:
    import cPickle as pickle
except ImportError:
    import pickle
import nltk


def create_punkt_sent_detector(fnames, punkt_fname, progress_count=None):
    """Makes a pass through the corpus to train a Punkt sentence segmenter.

    Args:
        fname: List of filenames to be used for training.
        punkt_fname: Filename to save the trained Punkt sentence segmenter.
        progress_count: Display a progress count every integer number of pages.
    """
    logger = logging.getLogger('create_punkt_sent_detector')

    punkt = nltk.tokenize.punkt.PunktTrainer()

    logger.info("Training punkt sentence detector")

    doc_count = 0
    try:
        for fname in fnames:
            with open(fname, mode='rb') as f:
                punkt.train(f.read(), finalize=False, verbose=False)
                doc_count += 1
                if progress_count and doc_count % progress_count == 0:
                    logger.debug('Pages processed: %i', doc_count)
    except KeyboardInterrupt:
        print 'KeyboardInterrupt: Stopping the reading of the dump early!'

    logger.info('Now finalzing Punkt training.')

    punkt.finalize_training(verbose=True)
    learned = punkt.get_params()
    sbd = nltk.tokenize.punkt.PunktSentenceTokenizer(learned)
    with open(punkt_fname, mode='wb') as f:
        pickle.dump(sbd, f, protocol=pickle.HIGHEST_PROTOCOL)

    return sbd


if __name__ == 'main':
    punkt_fname = 'punkt_russian.pickle'
    try:
        with open(punkt_fname, mode='rb') as f:
            sent_detector = pickle.load(f)
    except (IOError, pickle.UnpicklingError):
        sent_detector = None

    if sent_detector is None:
        corpora = ['russian-1.txt', 'russian-2.txt']
        sent_detector = create_punkt_sent_detector(fnames=corpora,
                                                   punkt_fname=punkt_fname)

    tokenized_text = sent_detector.tokenize("some russian text.",
                                            realign_boundaries=True)
    print '\n'.join(tokenized_text)
Wesley Baugh
  • 3,720
  • 4
  • 24
  • 42
  • 1
    Thanks for the great script on modifying the `PunktSentTokenize`, do you know how I could (1) add abbr to the Punkt abbr parameter and (2) train a word_tokenizer like how you've trained the sent_tokenizer? – alvas Nov 07 '13 at 11:30
1

You can take the trained Russian sentence tokenizer from https://github.com/Mottl/ru_punkt which can deal with Russian names initials and abbreviations.

text = ("организовывал забастовки и демонстрации, ",
        "поднимал рабочих на бакинских предприятиях А.И. Манташева.")
print(tokenizer.tokenize(text))

Output:

['организовывал забастовки и демонстрации, поднимал рабочих на бакинских предприятиях А.И. Манташева.']
Dmitry Mottl
  • 842
  • 10
  • 17