5

I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.

I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it with providing additional abbreviations but the results are still not accurate enough.

Since NLTK PunktSentenceTokenizer bases on the unsupervised algorithm by Kiss & Strunk (2006) I am trying to train the sentence tokenizer based on my documents, based on training data format for nltk punkt.

import nltk.tokenize.punkt
import pickle
import codecs

tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
text = codecs.open("someplain.txt", "r", "utf8").read()
tokenizer.train(text)
out = open("someplain.pk", "wb")
pickle.dump(tokenizer, out)
out.close()

Unfortunately, when running the code, I got an error, that there is not sufficient memory. (Mainly because I first concatenated all the files to one big file.)

Now my questions are:

  1. How can I train the algorithm batchwise and would that lead to a lower memory consumption?
  2. Can I use the standard English pickle file and do further training with that already trained object?

I am using Python 3.6 (Anaconda 5.2) on Windows 10 on a Core I7 2600K and 16GB RAM machine.

colidyre
  • 4,170
  • 12
  • 37
  • 53
JumpinMD
  • 53
  • 6
  • Besides: It's better to use a `with`-statement to open files, see [docs](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files). – colidyre Sep 03 '18 at 12:56

2 Answers2

2

I found this question after running into this problem myself. I figured out how to train the tokenizer batchwise and am leaving this answer for anyone else looking to do this. I was able to train a PunktSentenceTokenizer on roughly 200GB of Biomedical text content in around 12 hours with a memory footprint no greater than 20GB at a time. Nevertheless, I'd like to second @colidyre's recommendation to prefer other tools over the PunktSentenceTokenizer in most situations.

There is a class PunktTrainer you can use to train the PunktSentenceTokenizer in a batchwise fashion.

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer

Suppose we have a generator that yields a stream of training texts

texts = text_stream()

In my case, each iteration of the generator queries a database for 100,000 texts at a time, then yields all of these texts concatenated together.

We can instantiate a PunktTrainer and then begin training

trainer = PunktTrainer()
for text in texts:
    trainer.train(text)
    trainer.freq_threshold()

Notice the call to the freq_threshold method after processing each text. This reduces the memory footprint by cleaning up information about rare tokens that are unlikely to influence future training.

Once this is complete, call the finalize training method. Then you can instantiate a new tokenizer using the parameters found during training.

trainer.finalize_training()
tokenizer = PunktSentenceTokenizer(trainer.get_params())

@colidyre recommended using spaCy with added abbreviations. However, it can be difficult to know which abbreviations will appear in you text domain in advance. To get the best of both worlds you can add the abbreviations found by Punkt. You can get a set of these abbreviations in the following way

params = trainer.get_params()
abbreviations = params.abbrev_types
Albert Steppi
  • 291
  • 2
  • 7
1

As described in the source code:

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

It is not very clear what a large collection really means. In the paper, there are no information given about learning curves (when it is sufficiant to stop learning process, because enough data was seen). Wall Street Journal corpus is mentioned there (it has approximately 30 million words). So it is very unclear if you can simply trim your training corpus and have less memory footprints.

There is also an open issue on your topic saying something about 200 GB RAM and more. As you can see there, NLTK has probably not a good implementation of the algorithm presented by Kiss & Strunk (2006).

I cannot see how to batch it, as you can see in the function signature of train()-method (NLTK version 3.3):

def train(self, train_text, verbose=False):
    """
    Derives parameters from a given training text, or uses the parameters
    given. Repeated calls to this method destroy previous parameters. For
    incremental training, instantiate a separate PunktTrainer instance.
    """

But there are probably more issues, e.g. if you compare the signature of given version 3.3 with the git tagged version 3.3, there is a new parameter finalize which might be helpful and indicates a possible batch-process or a possible merge with an already trained model:

def train(self, text, verbose=False, finalize=True):
    """
    Collects training data from a given text. If finalize is True, it
    will determine all the parameters for sentence boundary detection. If
    not, this will be delayed until get_params() or finalize_training() is
    called. If verbose is True, abbreviations found will be listed.
    """

Anyway, I would strongly recommend not using NLTK's Punkt Sentence Tokenizer if you want to do sentence tokenization beyond playground level. Nevertheless, if you want to stick to that tokenizer, I would simply recommend using also the given models and not train new models unless you have a server with huge RAM memory.

colidyre
  • 4,170
  • 12
  • 37
  • 53
  • Thanks for your suggestion. I wasn't aware of the finalize opportunity. I will give it try. As you do not recommend nltk for sentence segmentation, do you have any recommendation how to accomplish this task? Can you recommend any libraries or methods? – JumpinMD Sep 05 '18 at 06:22
  • @JumpinMD you can take [spaCy](https://spacy.io/) for example if you want to have an easy solution with Python without much effort but strength and so much more features. Simply load the corpus with a (dependency) parser and then you can simply iterate over the sentences (automatically detected). That's only one of many solutions and could be a little bit overpowered for your needs. But it's fast and production ready. If you want to do sentence tokenization only as a preprocessing step, you can use e.g. Stanford's tokenizer (but there are tons of more tokenizers out there -> Search engine). – colidyre Sep 05 '18 at 11:00
  • I have already tried spaCy but wasn't able to add additional abbreviations. Without that, spaCy was quite bad. Do you know a way to do that? – JumpinMD Sep 05 '18 at 12:47
  • This is going off topic... But I think you can adjust your [vocabulary](https://spacy.io/api/vocab) (inkluding abbreviations) for the [tokenization task](https://spacy.io/usage/linguistic-features#section-tokenization). [This](https://spacy.io/usage/linguistic-features#sbd-manual) might also be interesting for you. – colidyre Sep 05 '18 at 13:08
  • @JumpinMD did you find any resolution to this? – echan00 Nov 08 '18 at 23:32