7

I want to do some mining on tweets. Is there any more specific stop word list for tweets such as removing "lol" and other twitter smiley?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
陈家泽
  • 115
  • 1
  • 4

3 Answers3

5

I guess you should merge ordinary stop word list, like this one or that, with the specific acronyms dictionary, e.g. this slang dictionary, or that, or that, or that (the last one seems to be the easiest for parsing, see comments here for the idea).

Community
  • 1
  • 1
Nikita Astrakhantsev
  • 4,701
  • 1
  • 15
  • 26
3

I'm not aware of a specific stopwords list, but you could get a list of most frequent single words here: http://clic.cimec.unitn.it/amac/twitter_ngram/ (download en.1grams.gz)

To detect and then ignore smilies use: https://github.com/brendano/tweetmotif

You may also find these tools useful: https://github.com/willf/segment (if you want to segment hashtags) https://github.com/amacinho/Rovereto-Twitter-Tokenizer (if you don't)

zelandiya
  • 51
  • 3
0

I'm not aware of a Twitter-specific stop word list, but it is common practice to simply remove the n most frequent words from your analyses, where n could be 100, for example. Depending on what you would like to do, smileys may actually provide very relevant information.

yvespeirsman
  • 3,099
  • 20
  • 21