How to check language of the parsed data in java

Question

I am parsing one service of google.Which results data in multiple langugae.While i want data in english only. How can i ensure the langugae. Please suggest.

String url = "https://newsapi.org/v2/top-headlines?sources=google-news&apiKey=89c8009165774e0fad3742f78b50c6da";

            URL url1 = new URL(url);
            URLConnection uc = url1.openConnection();

            InputStreamReader input = new InputStreamReader(uc.getInputStream());
            BufferedReader in = new BufferedReader(input);

            String inputLine;
            String fullline = "";

            while ((inputLine = in.readLine()) != null) {
                fullline = fullline.concat(inputLine);
            }

            JSONObject rootObject = new JSONObject(fullline);

            JSONArray rows1 = (JSONArray) rootObject.get("articles");

Sample data is:

    {
  "status": "ok",
  "totalResults": 1100,
  "articles": [
    {
      "source": {
        "id": null,
        "name": "Ua-football.com"
      },
      "author": "Спорт.ua",
      "title": "Шаран може покинути Олександрію після закінчення цього сезону",
      "description": "Контракт наставника закінчується влітку",
      "url": "https://www.ua-football.com/ua/ukrainian/high/1521800049-sharan-mozhe-pokinuti-oleksandriyu-pislya-zakinchennya-cogo-sezonu.html",
      "urlToImage": "https://static.ua-football.com/img/upload/18/24f537.jpeg",
      "publishedAt": "2018-03-23T10:22:21Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "西武菊池雄星、開幕へ万全 ＯＰ戦ラス投５回無失点",
      "description": null,
      "url": "https://www.nikkansports.com/baseball/news/201803230000734.html",
      "urlToImage": null,
      "publishedAt": "2018-03-23T10:20:46Z"
    },
    {
      "source": {
        "id": null,
        "name": "Siol.net"
      },
      "author": null,
      "title": "Picomat na Koroškem je postal prava atrakcija #video",
      "description": "Slovenj Gradec se je pred kratkim obogatil s pridobitvijo, s katero se lahko pohvalita tudi Dubaj in Dunaj. Na Koroškem je za pravo revolucijo poskrbel picomat, ki je postal pravi magnet za odrasle in mladino. Uporabniki morajo samo pritisniti na gumb in svež…",
      "url": "https://siol.net/trendi/kulinarika/picomat-na-koroskem-je-postal-prava-atrakcija-video-463111",
      "urlToImage": "https://siol.net/media/img/9a/c1/8b62129ba4efcbf0faf9-picomat.jpeg",
      "publishedAt": "2018-03-23T10:19:56Z"
    },
    {
      "source": {
        "id": null,
        "name": "Nikkansports.com"
      },
      "author": null,
      "title": "明秀日立・金沢監督「勝ちに不思議な勝ちあり」"
    }
  ]
}

You can do `String#matches(".*\\b\\w+\\b.*");` and filter out the result, if the RegEx doesn't match. That would at least get you rid of the non-latin character strings like Chinese and Japanese. — Impulse The Fox, Mar 23 '18 at 11:00
Thanks for your reply. But I don't want any other language apart from English, using this regex french or languages similar to English won't be identified. — Tanu Garg, Mar 23 '18 at 11:06
Maybe this could be interesing for you: https://cloud.google.com/translate/docs/detecting-language#translate-detect-language-java — Impulse The Fox, Mar 23 '18 at 11:13

score 1 · Answer 1 · answered Mar 23 '18 at 12:10

1

You are looking for a way to identify the language of a text which is a hard problem to solve.

You will most likely need to integrate a library or rely on a 3rd party API.

There are useful links here. You may also use IBM Watson API.

answered Mar 23 '18 at 12:10

Andrew

2,663
6
28
50

score 1 · Answer 2 · answered Mar 23 '18 at 12:43

Work with word frequencies. Take the most frequent words, best get to know how much percent these words are part of a normal text, and check.

public boolean isEnglish(String text) {
    Set<String> mostFrequentWords = new HashSet<>();
    Collections.addAll(mostFrequentWords,
        "the", "of", "and", "a", "to", "in", "is", "be", "that", "was", "he", "for",
        "it", "with", "as", "his", "i", "on", "have", "at", "by", "not", "they",
        "this", "had", "are", "but", "from", "or", "she", "an", "which", "you", "one",
        "we", "all", "were", "her", "would", "there", "their", "will", "when", "who",
        "him", "been", "has", "more", "if", "no", "out", "do", "so", "can", "what",
        "up", "said", "about", "other", "into", "than", "its", "time", "only", "could",
        "new", "them", "man", "some", "these", "then", "two", "first", "may", "any",
        "like", "now", "my");

    int wordCount = 0;
    int hits = 0;

    Pattern wordPattern = Pattern.compile("\\b\\p{L}+\\b");
    Matcher m = wordPattern.matcher(text);
    while (m.find() && wordCount < 100) {
        String word = m.group().toLowerCase(Locale.ENGLISH);
        ++wordCount;
        if (mostFrequentWords.contains(word)) {
           ++hits;
        }
    }
    return hits * 100 / wordCount >= 30; // At least 30 percent
}

Also non-Latin can be detected as:

String ascii = text.replaceAll("\\P{ASCII}", "");
if ((text.length() - ascii.length()) * 100 / text.length() > 10) {
    return false; // More than 10% non-ASCII
}

Notice that some interpunction, like comma like quotes, bullet points, dashes, are not ASCII. Or a loanword like mañana and façade.

score -1 · Answer 3 · answered Mar 23 '18 at 19:04

-1

Found the solution. Working as per expectations.

private static boolean isEnglish(String text) {
        CharsetEncoder asciiEncoder = Charset.forName("US-ASCII").newEncoder();
        CharsetEncoder isoEncoder = Charset.forName("ISO-8859-1").newEncoder();
        return  asciiEncoder.canEncode(text) || isoEncoder.canEncode(text);
    }

answered Mar 23 '18 at 19:04

Tanu Garg

3,007
4
21
29

That likely wouldn't work as you can have English in `UTF-8`, which tends to be a standard code page. – Simon O'Doherty Mar 24 '18 at 05:22

How to check language of the parsed data in java

3 Answers3