0

I am using twitter streaming API to get real time tweets and I am checking lang . I am extracting hashTags from those tweets but the problem is when I am extracting the hashtags from tweettext iam getting english and non-english hashtags. Is there any way to extract only english hashtag from a particular tweettext.My code after getting tweettext to extract hashtags

private String getHashTag(String TweetText) {
     String[] words = TweetText.split(" ");
        Set<String> hashtags = new HashSet<String>();
        for (String word : words) {
            if (word.startsWith("#")) {
                hashtags.add(word);
            }
        }
        return hashtags.toString();
}
Anji
  • 285
  • 2
  • 7
  • 20

2 Answers2

1

You should use Apache Tika and its API for language detection. This is an example:

import org.apache.tika.language.LanguageIdentifier;

LanguageIdentifier identifier = new LanguageIdentifier(word);
String language = identifier.getLanguage();

With this solution you can get the language and therefore consider only english tweets.

giograno
  • 1,749
  • 3
  • 18
  • 30
  • But I think if we are using the Apache Tika API we will face performance issue if I'm not wrong!! – Anji Jan 13 '16 at 10:26
  • Language detection is not a lightweight task! I think that if your data set is not too massive this solution could be ok! However your could directly mines your tweets selecting only English ones. – giograno Jan 13 '16 at 10:32
0

What you want is to detect the language of a string. See this post: How to detect language of user entered text?

Community
  • 1
  • 1