How to tell if a string is in a different language. (Not ASCII)

Question

I am using JSoup to grab some information from a different site. The information is in a different language, but uses Arabic characters such as کور. And I'm not 100% sure but I think that those are not ASCII characters. How can I tell if that string is not ASCII (if I'm correct that it is not) and then grab that string.

EDIT: After using the guava library and the piece of code, I get the following output.

Home New 215

Add Words

Statistics

About Us

Feedback

اردلی

انرکه

خونه

سرای

سرپناه

کور

ګمرک

The problem is that although the non ASCII strings are being printed such as "کور" but the ASCII strings such as " Feedback" are being printed.

Here is the code that I'm using.

import java.io.IOException;
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.google.common.base.CharMatcher;

public class GrabLinks {

public static void main(String[] args) {

    Document doc;
    PrintStream out = null;
    try {
        out = new PrintStream(System.out, true, "UTF-8");
    } catch (UnsupportedEncodingException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    
    try {
        // need http protocol
        doc = Jsoup.connect("http://thepashto.com/word.php?pashto=&english=house").get();

        // get page title
        String title = doc.title();
        //System.out.println("title : " + title);

        // get all links
        Elements links = doc.select("a[href]");
        for (Element link : links) {

            // get the value from href attribute
            //System.out.println("\nlink : " + link.attr("href"));
            //System.out.println("text : " + link.text());

            if (!CharMatcher.ASCII.matchesAllOf(link.text())) {
                
                out.println(link.text());
            }
        }

    } catch (IOException e) { e.printStackTrace(); }
    
}
}

http://stackoverflow.com/questions/3585053/in-java-is-it-possible-to-check-if-a-string-is-only-ascii — raffian, Aug 27 '13 at 20:29
If you already have a `String`, it's either too late or irrelevant. What are you trying to do? Show us your code. — SLaks, Aug 27 '13 at 20:30
For reference, you are correct that these characters are not ASCII. They're part of [UTF-8](http://en.wikipedia.org/wiki/UTF-8). — Luke Willis, Aug 27 '13 at 20:36
Confusing question. It is possible to contain text in a langyage that is not English using only ASCII characters. Similarly, evidence that text contains extended characters is not proof that the language is not English. Are you wanting an assessment of the language that is used? Or are you trying to find out the character sets or code pages that are being used? — scottb, Aug 27 '13 at 20:38
I'd just use something that understands UTF-8. Where are you outputting the text? Console? File? Html? Anyhow, all of these should handle UTF-8 just fine. — Erik A. Brandstadmoen, Aug 27 '13 at 20:46
I'm reading it from a website using JSoup and then printing it out to the Console. — user2612619, Aug 27 '13 at 20:50
[Relevant information here](http://www.joelonsoftware.com/articles/Unicode.html). — Henry Keiter, Aug 27 '13 at 20:55
At best you can only make an educated guess. But you need to understand what ASCII is and isn't. It's the 7-bit subset that's at the heart of most "Roman" character encodings. It can be extended any number of ways, into "double-byte character" encodings using shift-in/shift-out characters, or into UTF-8 (8-bit Unicode) using a somewhat more complex encoding scheme. There are dozens of double-byte character sets (DBCS), but basically only one UTF-8. Since the above data is printing nicely without you having to set a code page it is most likely UTF-8. — Hot Licks, Aug 27 '13 at 21:16

Martin Seeler · Answer 1 · 2013-08-27T20:51:17.403

0

If you use Google's Library Guava, you can check if a String is ASCII or not with the class CharMatcher.ASCII.

This is an example how to use it:

public static void main(String[] args) {
    System.out.println(isASCIIString("کور")); // false
    System.out.println(isASCIIString("Hi")); // true
}

public static boolean isASCIIString( String pString ) {
    return CharMatcher.ASCII.matchesAllOf(pString);
}

EDIT:

With this code, you can only check if this is ASCII or not. The output in your terminal will not depend on that, since the default OutputStream will not support this. System.out prints Unicode strings using the MacRoman charset and not UTF-8. To print your Characters, this could help:

PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(yourString);

edited Aug 27 '13 at 20:51

answered Aug 27 '13 at 20:30

Martin Seeler

6,874
3
33
45

You should probably mention how to use this as it isn't standard library. – DrYap Aug 27 '13 at 20:31
For some reason that prints the non ASCII and ASCII strings. I've updated the OP. – user2612619 Aug 27 '13 at 20:42
Sorry if I wasn't clear, I wasn't too concerned about the "?" if you see in the quote "Feedback", "Home" and those are being printed but they are ASCII characters. I need the non ASCII only to be printed. – user2612619 Aug 27 '13 at 20:54

How to tell if a string is in a different language. (Not ASCII)

1 Answers1