I have ~2 million rows of text. Each row can be one or multiple sentences. I need to POS tag the entire corpus. The corpus is a list of strings, for example:
corpus = ["I am awesome. I really am.", "The earth is round.", \
"What is the name of our planet? Is it Earth?"]
The corpus
has around 2 million strings, which I'm reading out of a database.
My ideal code for POS tagging looks like this, where I tokenise sentences, then tokenise words, and then POS tag:
from nltk import word_tokenize, sent_tokenize
from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()
for item in corpus:
for sentence in sent_tokenize(item):
tags = tagger.tag(word_tokenize(sentence))
This, however, is extremely slow. I read about pos_tag_sents()
but I'm guessing that's going to take forever to do 2 million data points in one shot. Is there any other, faster way of doing this? I'm looking for a way to speed this up at least 2-3x. My main objective is to capture major word forms (nouns, verbs, question words, etc.), so I'm open to other POS taggers, provided they can speed up the process by 2-3x.