0

I have ~2 million rows of text. Each row can be one or multiple sentences. I need to POS tag the entire corpus. The corpus is a list of strings, for example:

corpus = ["I am awesome. I really am.", "The earth is round.", \
"What is the name of our planet? Is it Earth?"]

The corpus has around 2 million strings, which I'm reading out of a database.

My ideal code for POS tagging looks like this, where I tokenise sentences, then tokenise words, and then POS tag:

from nltk import word_tokenize, sent_tokenize
from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger()
for item in corpus:
    for sentence in sent_tokenize(item):
        tags = tagger.tag(word_tokenize(sentence))

This, however, is extremely slow. I read about pos_tag_sents() but I'm guessing that's going to take forever to do 2 million data points in one shot. Is there any other, faster way of doing this? I'm looking for a way to speed this up at least 2-3x. My main objective is to capture major word forms (nouns, verbs, question words, etc.), so I'm open to other POS taggers, provided they can speed up the process by 2-3x.

Prateek Dewan
  • 1,587
  • 3
  • 16
  • 29
  • I think `sent_tokenize` and `word_tokenize` takes much time. check [this SO answer](https://stackoverflow.com/a/35348340/3210415) maybe helps – Nicolae Jun 01 '20 at 17:22
  • Explored `split()` already. The issue there is that in a lot of cases, the part of speech tagger doesn't work well when there are trailing punctuation marks at the end of words. My corpus is user-entered text, so it is likely to contain all sorts of junk. I can explore `split()` with `rstrip()`. It might be faster. But I really want to believe there is a better way! – Prateek Dewan Jun 01 '20 at 17:54
  • 1
    did you try to first clean the sentences from any punctuation ex: `for i in range(len(corpus)): tokenizer = nltk.RegexpTokenizer(r"\w+") corpus[i] = " ".join(tokenizer.tokenize(corpus[i]))` – Nicolae Jun 01 '20 at 18:01
  • I did not... Let me try this out. Thanks! – Prateek Dewan Jun 01 '20 at 18:13

0 Answers0