Fast way to remove non-alnum and stopwords from text?

Question

I want to create, for each line in my pandas df, a list that contains only alnum and non-stopwords tokens. Each element in df.text.iteritems() is a complete text. I ran this, below, but it is taking way more than I expected. Any clever solutions? Thanks in advance

import pandas as pd
from nltk import tokenize
from nltk.corpus import stopwords

tokens_linhas = []
for index, value in df.text.iteritems():
    tokens = tokenize.word_tokenize(value)
    tokens_sem_pont = [token for token in tokens if token.isalnum() and (token not in stopwords.words('english'))]
    tokens_linhas.append(tokens_sem_pont)

EDIT: Code just finished, and it went wrong** when I pass df['tokens'] = tokens_linhas. This, below, was working just fine.

for index, value in df.text.iteritems():
    tokens = tokenize.word_tokenize(value)
    tokens_sem_pont = [token for token in tokens if token.isalnum()]
    tokens_linhas.append(tokens_sem_pont)

** ValueError: Length of values (508) does not match length of index (251)

Does [How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?](https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data) answer your question? — wwii, May 19 '21 at 15:15
It doesn't solve the problem, unless I can somehow use apply method to pass isalnum() and check if each token is a stopword. But I'll just use an apply for the tokenizer prior to the loop, saving some time, still. — Pedro Kaneto Suzuki, May 19 '21 at 16:40

Fast way to remove non-alnum and stopwords from text?

0 Answers0