Related to this question, I am working on a program to extract the introduction of wikipedia entities. As you can read in the above link, I already succeeded to query the api and am now focussing on the processing of the xml returned by the api call. I use nltk to process the xml, where I use
wikiwords = nltk.word_tokenize(introtext)
for wikiword in wikiwords:
wikiword = lemmatizer.lemmatize(wikiword.lower())
...
But with this I end up having recorded words like </
, /p
, <
, ... . Since I am not using the structure of the xml, simply ignoring all xml would work, I guess. Is there a tool of nltk or is there a stopwords list available. I would just like to know, what's best practice?