Non-English characters( More specifically Chinese/Oriental Strings) keep messing up my comparisons between two hashtables. I'm loading html a href attribute text from specified websites using Jsoup into hashtables, and the comparisons between them work with English characters. However when there's a comparison with '??' and '??', it marks them as different, which I don't want, therefore, I don't even want to load them into the hashtable, but it keeps doing so.
What I've tried so far: 1. Upon loading in the values from the site, I replace all question marks with blanks, and I have a condition to not add in empty strings. 2. check if the string is a latin character, but it seems that a lot of my inputs aren't for some reason, yet they read as Engish
Code right now:
private static void loadHTML(String url, Hashtable<String, Integer> updatedHash){
try{
Document doc = Jsoup.connect(url).get();
Elements containers = doc.select("a");
for (Element c: containers){
//get text in the a href attribute tag
String value = c.text().toLowerCase();
//boolean valid = value.matches("\\p{L}+");
value = value.replaceAll("\\?", " ");
if(!value.isEmpty() && (value.length() < 30)){
System.out.println(value);
//method for putting values in a specified hashtable
incrementValues(value,updatedHash);
}
}
} catch (Exception e){
System.out.println(e);
System.exit(1);
}
}
How it goes down right now: Say containers has [ hello, wow, cool, chineseCharacter, hello] The system would print out: hello, wow, cool, ??, hello and then still add ?? in.
I want: containers: [hello, wow, cool, chineseCharacter, l] Systemprint(value): hello, wow, cool, , l Hashtable keys must only be: hello, wow, cool, l]
Simplified Question: How can I determine if the certain string, isn't English?
Thanks!