1

I'm working on a twitter sentiment analysis tool in C++. So far I get the tweets from Twitter and I process them a bit ( lowercase, remove RT, remove # and URLs).

The next step is to remove emoticons and all those special characters. How does one do that? before you jump me, I already looked at other similar questions but none of them deals with C++. Mostly R,Python and PHP.

I was thinking to use regex however I can't get it to work. I tried it with removal of hashtags and URLs and I gave up. I ended up using normal string:find and find_first_of.

Is there any library or method available to get rid of those emoticons and special stuff ?

Thanks

1 Answers1

2

I would recommend using regular expressions for this. Now you have two options, you can either extract only the characters you are interested in (if you are working with English tweets this would probably be A-Z,a-z, numbers and maybe some symbols, depending on your needs), or you can select invalid characters (emoticons) and replace them with an empty string.

I only have experience with Qt's RegularExpression engine, but the c++ standard library has regex support (although I'm not sure how good it is with Unicode), but the ICU provides a regex library too.

*I'd provide more links but I don't have enough reputation yet :/

Community
  • 1
  • 1
Nicholas
  • 127
  • 2
  • 8
  • thanks. The problem with c++ regex is that I dont have any experience with it and i find it hard to work with. Plus I don;t know regex..especially for complicated stuff like emoticons and weird characters.. – Alexandru Lucian Susma Jun 02 '16 at 14:04
  • What types of characters do you want to extract? If you just want alpha-numeric, whitespace @ and # symbols a pattern like `[\s\w@#]` could work. I like to use [Rubular](http://rubular.com) or [Regex 101](https://regex101.com/) to test my regular expressions. – Nicholas Jun 03 '16 at 07:23