1

Non-English characters( More specifically Chinese/Oriental Strings) keep messing up my comparisons between two hashtables. I'm loading html a href attribute text from specified websites using Jsoup into hashtables, and the comparisons between them work with English characters. However when there's a comparison with '??' and '??', it marks them as different, which I don't want, therefore, I don't even want to load them into the hashtable, but it keeps doing so.

What I've tried so far: 1. Upon loading in the values from the site, I replace all question marks with blanks, and I have a condition to not add in empty strings. 2. check if the string is a latin character, but it seems that a lot of my inputs aren't for some reason, yet they read as Engish

Code right now:

private static void loadHTML(String url, Hashtable<String, Integer> updatedHash){
    try{
        Document doc = Jsoup.connect(url).get(); 
        Elements containers = doc.select("a");
        for (Element c: containers){
            //get text in the a href attribute tag
            String value = c.text().toLowerCase();

            //boolean valid = value.matches("\\p{L}+");

            value = value.replaceAll("\\?", " ");
            if(!value.isEmpty() && (value.length() < 30)){
                System.out.println(value);
                //method for putting values in a specified hashtable
                incrementValues(value,updatedHash);
            }
        }
    } catch (Exception e){
        System.out.println(e);
        System.exit(1);
    }
}

How it goes down right now: Say containers has [ hello, wow, cool, chineseCharacter, hello] The system would print out: hello, wow, cool, ??, hello and then still add ?? in.

I want: containers: [hello, wow, cool, chineseCharacter, l] Systemprint(value): hello, wow, cool, , l Hashtable keys must only be: hello, wow, cool, l]

Simplified Question: How can I determine if the certain string, isn't English?

Thanks!

Maxxy
  • 63
  • 6
  • 1
    Welcome to the wonderful world of [string encoding](https://www.w3.org/International/questions/qa-what-is-encoding) – litelite Jul 26 '17 at 20:01
  • 1
    My guess is that the text isn't *really* "??", but it's just "non-ASCII characters" that end up being displayed as "??" however you're displaying them (which we don't know). – Jon Skeet Jul 26 '17 at 20:02
  • I would concentrate on separating out the hash map part from the "reading the right text to start with" part. – Jon Skeet Jul 26 '17 at 20:02
  • Is the charset specified in the offending web page? If not from what I can tell jsoup will default to the platform encoding (utf-8 i'm guessing) so if the web page is encoded in other than utf-8 you're going to have problems. See [this](https://stackoverflow.com/questions/7703434/jsoup-character-encoding-issue) – JJF Jul 26 '17 at 20:10
  • Find `"[\\x{100}-\\x{10FFFF}]+"` replace `""` with Unicode option. –  Jul 26 '17 at 20:16
  • @sln wow, that worked, can you please explain further how that works, Thanks! – Maxxy Jul 26 '17 at 20:37
  • That's just the Unicode codepoint range U+000000 to U+10FFFF. Matches every character in that range. Includes _basic_ and _supplemental_ planes. You'd try to avoid these characters as keys. –  Jul 26 '17 at 20:50

0 Answers0