HashMap does not behave as expected for Chinese characters

Question

China-中国,CN
Angola-安哥拉,AO
Afghanistan-阿富汗,AF
Albania-阿尔巴尼亚,AL
Algeria-阿尔及利亚,DZ
Andorra-安道尔共和国,AD
Anguilla-安圭拉岛,AI

In Java, I'm reading the above text from a file and creating a map where the keys will be the part before the comma and the values will be the region code after the comma.

Here is the code:

public static void main(String[] args) {

    BufferedReader br;
    Map<String,String>  mymap = new HashMap<String,String>();
    try {
        br = new BufferedReader(new InputStreamReader(new FileInputStream("C:/Users/IBM_ADMIN/Desktop/region_code_abbreviations_Chinese.csv"), "UTF-8"));
        String line;
        while ((line = br.readLine()) != null) {
           //System.out.println(line);
           String[] arr= line.split(",");
           mymap.put(arr[0], arr[1]);
        }

        br.close();
    } catch (IOException e) {
        System.out.println("Failed to read users file.");
    } finally {}

    for(String s: mymap.keySet()){
        System.out.println(s);
        if(s.equals("China-中国")){
            System.out.println("Got it");
            break;
        }
    }

    System.out.println("----------------");
    System.out.println("Returned from map  "+ mymap.get("China-中国"));

    mymap = new HashMap<String,String>();
    mymap.put("China-中国","Explicitly Put");
    System.out.println(mymap.get("China-中国"));
    System.out.println("done");
}

The output:

:
:
Egypt-埃及
Guyana-圭亚那
New Zealand-新西兰
China-中国
Indonesia-印度尼西亚
Laos-老挝
Chad-乍得
Korea-韩国
:
:
Returned from map  null
Explicitly Put
done

Map is loaded correctly but when I search the map for "China-中国" - I do not get the value.

If I explicitly put "China-中国" in map, then it returns a value. Why is this happening?

Please clarify. The output you get can't possibly come from the code you posted. — JB Nizet, Dec 23 '16 at 07:06
System.out.println("Returned from map "+ mymap.get("China-中国")); why it prints null ? — Kaushik Lele, Dec 23 '16 at 07:07

score 1 · Answer 1 · edited May 23 '17 at 11:46

1

Since you are having a problem with the first value, I would check to see if the file starts with a BOM (Byte Order Mark).

If so, try stripping the BOM before processing.

See: Byte order mark screws up file reading in Java

edited May 23 '17 at 11:46

Community

1
1

answered Dec 23 '16 at 07:13

Patrick Parker

4,863
4
19
51

Yeah,the length is different. Check the length is the same?If not,change file encode to UTF-8 no BOM encoding format.And it's work. – Tom Grylls Dec 23 '16 at 07:28

wumpz · Accepted Answer · 2016-12-23T10:42:40.783

1

Check if your resource file is not UTF-8, e.g. UTF-8Y, with BOM Bytes at the start. But this would only infere with the first value. If you change the test to a value from the middle, do you have a value or not? If not then this is not the problem.

Second possibility is your source code file is not UTF-8. Therefore the byte sequence of "China-中国" of your resource file and your sourcecode file is not equal and you will not get a match. But you include the value with the sourcecodes byte sequence explicitly and it will be found.

In fact this is not a problem with HashMap but with character or file encoding.

edited Dec 23 '16 at 10:42

answered Dec 23 '16 at 07:32

wumpz

8,257
3
30
25

Bang on !! Yes, I searched other key other than first one, it worked. Then I added some dummy word as first line. It worked well for second row onwards. How to identify if it has unwanted characters at beginning. – Kaushik Lele Dec 23 '16 at 08:40
@KaushikLele 1. first possibility mentioned by wumpz is a duplicate of my answer, also I provided a link to the recommended solutions. (would be nice if the first person to post the answer gets credit...) 2. second possibility mentioned by wumpz is not even possible, based on the given output. – Patrick Parker Dec 23 '16 at 14:10
Oops. It was not my intention to duplicate. Sorry. – wumpz Dec 23 '16 at 14:35
@PatrickParker Your answer is helpful and informative. But above answer was seen before your answer in default sorted way. Also above answer which is given in layman term helped me more to have a quick workaround for my problem. Hence that was best answer for me. I have +1'd you and you will surely get many more credit by subsequent viewers. As you must have understood, a better/simple explanation enhances acceptability of answer. – Kaushik Lele Dec 24 '16 at 05:46

score 0 · Answer 3 · answered Dec 23 '16 at 07:58

0

You can use org.apache.commons.io.input.BOMInputStream.

BufferedReader br= new BufferedReader(new InputStreamReader(new BOMInputStream(new FileInputStream("filepath")),"UTF-8"))

answered Dec 23 '16 at 07:58

Tom Grylls

121
7

The line first value is BOM. You can use replace to do this also. line.replace("\uFEFF", ""). But just for UTF-8 encoding. – Tom Grylls Dec 23 '16 at 08:44

HashMap does not behave as expected for Chinese characters

3 Answers3