I want to do some mining on tweets. Is there any more specific stop word list for tweets such as removing "lol" and other twitter smiley?
3 Answers
I guess you should merge ordinary stop word list, like this one or that, with the specific acronyms dictionary, e.g. this slang dictionary, or that, or that, or that (the last one seems to be the easiest for parsing, see comments here for the idea).

- 1
- 1

- 4,701
- 1
- 15
- 26
I'm not aware of a specific stopwords list, but you could get a list of most frequent single words here: http://clic.cimec.unitn.it/amac/twitter_ngram/ (download en.1grams.gz)
To detect and then ignore smilies use: https://github.com/brendano/tweetmotif
You may also find these tools useful: https://github.com/willf/segment (if you want to segment hashtags) https://github.com/amacinho/Rovereto-Twitter-Tokenizer (if you don't)

- 51
- 3
I'm not aware of a Twitter-specific stop word list, but it is common practice to simply remove the n most frequent words from your analyses, where n could be 100, for example. Depending on what you would like to do, smileys may actually provide very relevant information.

- 3,099
- 20
- 21
-
I am doing some retrieval on tweets data. I think smileys are meaningless for my retrieval job. – 陈家泽 Apr 30 '15 at 09:11