I am using JSoup to grab some information from a different site. The information is in a different language, but uses Arabic characters such as کور. And I'm not 100% sure but I think that those are not ASCII characters. How can I tell if that string is not ASCII (if I'm correct that it is not) and then grab that string.
EDIT: After using the guava library and the piece of code, I get the following output.
Home New 215
Add Words
Statistics
About Us
Feedback
اردلی
انرکه
خونه
سرای
سرپناه
کور
ګمرک
The problem is that although the non ASCII strings are being printed such as "کور" but the ASCII strings such as " Feedback" are being printed.
Here is the code that I'm using.
import java.io.IOException;
import java.io.PrintStream;
import java.io.UnsupportedEncodingException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.google.common.base.CharMatcher;
public class GrabLinks {
public static void main(String[] args) {
Document doc;
PrintStream out = null;
try {
out = new PrintStream(System.out, true, "UTF-8");
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
try {
// need http protocol
doc = Jsoup.connect("http://thepashto.com/word.php?pashto=&english=house").get();
// get page title
String title = doc.title();
//System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
//System.out.println("\nlink : " + link.attr("href"));
//System.out.println("text : " + link.text());
if (!CharMatcher.ASCII.matchesAllOf(link.text())) {
out.println(link.text());
}
}
} catch (IOException e) { e.printStackTrace(); }
}
}