0

I have a JSON file with different names of countries and languages etc. I want to strip it down to just the information I need/want for what I am doing. For example I would like to turn

[{
    "name": {
        "common": "Afghanistan",
        "official": "Islamic Republic of Afghanistan",
        "native": {
            "common": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646",
            "official": "\u062f \u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646 \u0627\u0633\u0644\u0627\u0645\u064a \u062c\u0645\u0647\u0648\u0631\u06cc\u062a"
        }
    },
    "tld": [".af"],
    "cca2": "AF",
    "ccn3": "004",
    "cca3": "AFG",
    "currency": ["AFN"],
    "callingCode": ["93"],
    "capital": "Kabul",
    "altSpellings": ["AF", "Af\u0121\u0101nist\u0101n"],
    "relevance": "0",
    "region": "Asia",
    "subregion": "Southern Asia",
    "nativeLanguage": "pus",
    "languages": {
        "prs": "Dari",
        "pus": "Pashto",
        "tuk": "Turkmen"
    },
    "translations": {
        "cym": "Affganistan",
        "deu": "Afghanistan",
        "fra": "Afghanistan",
        "hrv": "Afganistan",
        "ita": "Afghanistan",
        "jpn": "\u30a2\u30d5\u30ac\u30cb\u30b9\u30bf\u30f3",
        "nld": "Afghanistan",
        "rus": "\u0410\u0444\u0433\u0430\u043d\u0438\u0441\u0442\u0430\u043d",
        "spa": "Afganist\u00e1n"
    },
    "latlng": [33, 65],
    "demonym": "Afghan",
    "borders": ["IRN", "PAK", "TKM", "UZB", "TJK", "CHN"],
    "area": 652230
}, ...

Into

[{
    "name": {
        "common": "Afghanistan",
        "native": {
            "common": "\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646"
        }
    },
    "cca2": "AF"
}, ...

But when I try I get

[{
    "name": {
        "common": "Afghanistan",
        "native": {
            "common": "?????????"   <-- NOT WHAT I WANT
        }
    },
    "cca2": "AF"
},

Here is the important code I used to strip out what I don't want.

byte[] encoded = Files.readAllBytes(Paths.get("countries.json"));
String JSONString =  new String(encoded, Charset.forName("US-ASCII"));
...
Writer writer = new OutputStreamWriter(new FileOutputStream("countriesBetter.json"), "US-ASCII");
writer.write(javaObject.toString());
writer.close();

I cannot figure out why it turns the text into question marks. I have tried several character sets to no avail. When I use UTF-8 i get ا�غانستان

Please help me. Thank you.

J Blaz
  • 783
  • 1
  • 6
  • 26
  • 2
    `new String(encoded, Charset.forName("US-ASCII"));` what do you expect this to do ? – njzk2 Oct 06 '16 at 21:40
  • `When I use UTF-8 i get اÙ�غانستان` the problem here is how you read it. the file is probably fine. – njzk2 Oct 06 '16 at 21:40
  • Give me a string that is the bytes given. – J Blaz Oct 06 '16 at 21:41
  • And what exactly do you expect ``\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646`` to look like in a file? – f1sh Oct 06 '16 at 21:41
  • `\u0627\u0641\u063a\u0627\u0646\u0633\u062a\u0627\u0646` the two blocks of code that are stripped down are respectively how I want it when I look at it in notepad++ and how is it. – J Blaz Oct 06 '16 at 21:42
  • If you want to process JSON text files, use a **JSON parser/generator**. It will know how to write the JSON back out correctly. – Andreas Oct 06 '16 at 22:23

2 Answers2

1

\u0627 is unicode not ascii and you cannot represent the arabic characters in ascii - hence the ?. For differences between utf formats see Difference between UTF-8 and UTF-16?

when you write it UTF-8 you need to read in the same encoding so the "notepad" knows how to display the bytes it has. If you read it back into java using that encoding it will be unaltered.

Community
  • 1
  • 1
stevegal
  • 66
  • 3
0

You will need to change the console encoding to see this.

Go to Run>Run configurations

A pop up will open. Select common tab. In the Encoding section, select other and in dropdown select UTF-8.

Now run the program. I got the below result:

[ {
  "name" : {
    "common" : "Afghanistan",
    "official" : "Islamic Republic of Afghanistan",
    "natives" : {
      "common" : "افغانستان",
      "official" : "د افغانستان اسلامي جمهوریت"
    }
  },
  "tld" : [ ".af" ],
  "cca2" : "AF",
  "ccn3" : "004",
  "cca3" : "AFG",
  "currency" : [ "AFN" ],
  "callingCode" : [ "93" ],
  "capital" : "Kabul",
  "altSpellings" : [ "AF", "Afġānistān" ],
  "relevance" : "0",
  "region" : "Asia",
  "subregion" : "Southern Asia",
  "nativeLanguage" : "pus",
  "languages" : {
    "prs" : "Dari",
    "pus" : "Pashto",
    "tuk" : "Turkmen"
  },
  "translations" : {
    "cym" : "Affganistan",
    "deu" : "Afghanistan",
    "fra" : "Afghanistan",
    "hrv" : "Afganistan",
    "ita" : "Afghanistan",
    "jpn" : "アフガニスタン",
    "nld" : "Afghanistan",
    "rus" : "Афганистан",
    "spa" : "Afganistán"
  },
  "latlng" : [ 33, 65 ],
  "demonym" : "Afghan",
  "borders" : [ "IRN", "PAK", "TKM", "UZB", "TJK", "CHN" ],
  "area" : 652230
} ]
HARDI
  • 394
  • 5
  • 12