I have a df with several thousand rows of text data. I'm using spaCy to do some NLP on a single column of that df and and am trying to remove proper nouns, stop words, and punctuation from my text data using the following:
tokens = []
lemma = []
pos = []
for doc in nlp.pipe(df['TIP_all_txt'].astype('unicode').values, batch_size=9845,
n_threads=3):
if doc.is_parsed:
tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
else:
tokens.append(None)
lemma.append(None)
pos.append(None)
df['s_tokens_all_txt'] = tokens
df['s_lemmas_all_txt'] = lemma
df['s_pos_all_txt'] = pos
df.head()
But I get this error and I'm not sure why:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-34-73578fd46847> in <module>()
6 n_threads=3):
7 if doc.is_parsed:
----> 8 tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
9 lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
10 pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
<ipython-input-34-73578fd46847> in <listcomp>(.0)
6 n_threads=3):
7 if doc.is_parsed:
----> 8 tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
9 lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
10 pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
AttributeError: 'spacy.tokens.token.Token' object has no attribute 'is_propn'
If I take out the not n.is_propn the code runs as expected. I've googled around and read the spaCy documentation, but haven't been able to find an answer thus far.