How to tokenize big files in NLTK library

Question

I'm trying to tokenize this file with 32 MB of size. I'm trying to do this with Google Colab and Visual Code. It worked with a smaller file, but I would like to know how to do it with a bigger file in a feasible way (it's been passing more than 1 hour).

My code in google Colab:

import nltk
nltk.download('punkt')
from nltk import word_tokenize
from google.colab import drive
drive.mount('/drive')

raw = open('../drive/MyDrive/NLTK/data.txt').read()

tokens = word_tokenize(raw)

Am I doing something wrong?

you could try looping line by line like mentioned here - https://stackoverflow.com/a/48511849/12702651 — arjunsiva, Sep 01 '22 at 12:58

score 0 · Answer 1 · answered Sep 15 '22 at 17:07

You are reading and passing the entire file as a string, it’s not the nltk‘s fault. In fact the nltk has hidden support for incremental reads, so you should be able to work with iterators.

Use a PlaintextCorpusReader to read your data, then retrieve it incrementally by looping over mycorpus.words(). This way you could even retrieve sentences (with sents()), which do go across multiple lines. Do not try to collect a list of all the tokens, unless you really need that. That should do the right thing.

How to tokenize big files in NLTK library

1 Answers1