I want to create, for each line in my pandas df, a list that contains only alnum and non-stopwords tokens. Each element in df.text.iteritems()
is a complete text. I ran this, below, but it is taking way more than I expected. Any clever solutions? Thanks in advance
import pandas as pd
from nltk import tokenize
from nltk.corpus import stopwords
tokens_linhas = []
for index, value in df.text.iteritems():
tokens = tokenize.word_tokenize(value)
tokens_sem_pont = [token for token in tokens if token.isalnum() and (token not in stopwords.words('english'))]
tokens_linhas.append(tokens_sem_pont)
EDIT:
Code just finished, and it went wrong** when I pass df['tokens'] = tokens_linhas
.
This, below, was working just fine.
for index, value in df.text.iteritems():
tokens = tokenize.word_tokenize(value)
tokens_sem_pont = [token for token in tokens if token.isalnum()]
tokens_linhas.append(tokens_sem_pont)
** ValueError: Length of values (508) does not match length of index (251)