Spacy Stopwords based on Frequency

Question

i'm currently searching for an easy solution to add custom stopwords to spacy. These stopwords shall be determined on basis of the absolute frequency of the words in the whole corpus. E.g., in my domain-specific texts, the term "patient" should be considered a stopword as it occurs in 70% of all documents.

My first idea was to implement this by the help of pandas apply, but this would require to write my own tokenizing function. Is there a possibility to customize Spacy?

Thank you for any advice

score 1 · Accepted Answer · answered Mar 25 '18 at 20:45

To add custom stopwords into Spacy you can follow the solution given here: Add/remove stop words with spacy . Now in other to get a list recommended stopwords automatically, you can use NLTK package to calculate term frequency and document frequency (tf-idf), then define a trashold.

If you have any doubt, dont hesitate to comment.

Good luck!

Spacy Stopwords based on Frequency

1 Answers1