-1

I have a Python script which changes unicodes to it's real characters like this:

import json

def convert_unicode_to_hangul(data):
    if isinstance(data, dict):
        return {key: convert_unicode_to_hangul(value) for key, value in data.items()}
    elif isinstance(data, list):
        return [convert_unicode_to_hangul(item) for item in data]
    elif isinstance(data, str):
        return data.encode('euc_kr').decode('cp949')
    else:
        return data

# Open the file and read its contents
with open('script.json', 'r', encoding="utf-8") as f:
    input_data = json.load(f)

output_data = convert_unicode_to_hangul(input_data)
output_json = json.dumps(output_data, ensure_ascii=False, indent=2)

# Create the output filename with the desired postfix
output_filename = f"script_out.json"

# Write the output data to a new file
with open(output_filename, 'w', encoding="utf-8") as f:
    json.dump(output_data, f, ensure_ascii=False, indent=2)

The problem is even if it works with first characters, it somehow doesn't translate well, I don't know what I am missing here.

Sample input:

{
  "ScriptMethod": [
    {
      "Address": 25744128,
      "Name": "\uAC04\uAC28_TypeInfo",
      "Signature": "___505_c*"
    },
    {
      "Address": 25744184,
      "Name": "System.Xml.AttributePSVIInfo_TypeInfo",
      "Signature": "System_Xml_AttributePSVIInfo_c*"
    },
    {
      "Address": 25744216,
      "Name": "\uAC04\uAC2C_TypeInfo",
      "Signature": "___509_c*"
    },
    {
      "Address": 19592624,
      "Name": "\uAC03\uAC29.\uAC03\uAC28$$\uAC08\uAC00",
      "Signature": "void _________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "vii"
    },
    {
      "Address": 2331904,
      "Name": "\uAC03\uAC29.\uAC03\uAC28$$\uAC02\uAC11\uAC3D",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 2331904,
      "Name": "\uAC03\uAC29.\uAC03\uAC28$$\uAC02\uAC02\uAC21",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 19590352,
      "Name": "\uAC03\uAC29.\uAC03\uAC28$$MoveNext",
      "Signature": "bool _______MoveNext (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    }
  ]
}

Desired result:

{
  "ScriptMethod": [
    {
      "Address": 25744128,
      "Name": "간갨_TypeInfo",
      "Signature": "___505_c*"
    },
    {
      "Address": 25744184,
      "Name": "System.Xml.AttributePSVIInfo_TypeInfo",
      "Signature": "System_Xml_AttributePSVIInfo_c*"
    },
    {
      "Address": 25744216,
      "Name": "간갬_TypeInfo",
      "Signature": "___509_c*"
    },
        {
      "Address": 19592624,
      "Name": "갃갩.갃갨$$갈가",
      "Signature": "void _________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "vii"
    },
    {
      "Address": 2331904,
      "Name": "갃갩.갃갨$$갂갑갽",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 2331904,
      "Name": "갃갩.갃갨$$갂갂갡",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 19590352,
      "Name": "갃갩.갃갨$$MoveNext",
      "Signature": "bool _______MoveNext (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    }
  ]
}

Current output:

{
  "ScriptMethod": [
    {
      "Address": 25744128,
      "Name": "간ㅤㄱㅐㄽ_TypeInfo",
      "Signature": "___505_c*"
    },
    {
      "Address": 25744184,
      "Name": "System.Xml.AttributePSVIInfo_TypeInfo",
      "Signature": "System_Xml_AttributePSVIInfo_c*"
    },
    {
      "Address": 25744216,
      "Name": "간갬_TypeInfo",
      "Signature": "___509_c*"
    },
    {
      "Address": 19592624,
      "Name": "ㅤㄱㅏㄳㅤㄱㅐㄾ.ㅤㄱㅏㄳㅤㄱㅐㄽ$$갈가",
      "Signature": "void _________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "vii"
    },
    {
      "Address": 2331904,
      "Name": "ㅤㄱㅏㄳㅤㄱㅐㄾ.ㅤㄱㅏㄳㅤㄱㅐㄽ$$ㅤㄱㅏㄲ갑ㅤㄱㅑㄵ",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 2331904,
      "Name": "ㅤㄱㅏㄳㅤㄱㅐㄾ.ㅤㄱㅏㄳㅤㄱㅐㄽ$$ㅤㄱㅏㄲㅤㄱㅏㄲㅤㄱㅐㄵ",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 19590352,
      "Name": "ㅤㄱㅏㄳㅤㄱㅐㄾ.ㅤㄱㅏㄳㅤㄱㅐㄽ$$MoveNext",
      "Signature": "bool _______MoveNext (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    }
  ]
}

What should I change in order to get result like it is expected?

Abdullah Akçam
  • 299
  • 4
  • 18
  • Obligatory background reading: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Brian61354270 Mar 26 '23 at 21:32
  • `data.encode('euc_kr').decode('cp949')` makes no sense. Think about what that is saying: take a sequence of unicode codepoints, encode them into a sequence of bytes using the `'euc_kr'` encoding, then re-interpret that sequence of bytes as `'cp949'` encoded text and extract the corresponding sequence of unicode codepoints. – Brian61354270 Mar 26 '23 at 21:34
  • What is the expected encoding of the input and output files? You're reading and writing them both as UTF-8 in your question. Where do `'euc_kr'` and `'cp949'` come from? – Brian61354270 Mar 26 '23 at 21:39
  • I got the idea from here: https://stackoverflow.com/questions/46769520/python-encodes-korean-characters-in-an-unexpected-way-with-euc-kr-encoding-co and putting UTF-8 fixed some errors. – Abdullah Akçam Mar 26 '23 at 21:43
  • It seems that that question is about how the character `탇` doesn't have a direct encoding in `euc_kr`, and how instead a special decomposed byte sequence is used. The author is using `cp949` (a superset of `euc_kr` which _does_ have a direct encoding for `탇`) to side-step the decomposition logic the module is using for `탇` in `euc_kr` to show the literal interpretation of the decomposition. I'm not sure why that's connected to your problem. Doing that process yourself only serves to break up codepoints that can't be represented in `euc_kr` – Brian61354270 Mar 26 '23 at 22:00
  • Please [edit] your question to clarify what the actual encoding of the input data is and what encoding you want for the output data. Unless you have a special reason to do otherwise, the safest option is just to stick with UTF-8 everywhere. – Brian61354270 Mar 26 '23 at 22:05
  • @Brian61354270 If I use UTF-8 as you said, it translates the characters to gibberish like this "ê°ê°¨ – Abdullah Akçam Mar 26 '23 at 22:25
  • @Brian61354270 the problem is I don't know what the encoding it but it should work with UTF-8 if I am not mistaken but it just doesn't write the actual characters – Abdullah Akçam Mar 26 '23 at 22:29
  • Related: [How to determine the encoding of text](https://stackoverflow.com/q/436220/11082165) – Brian61354270 Mar 26 '23 at 22:33
  • If you don't know what the encoding of the input file is, there's not much that can be done. Either we know what characters the bytes are supposed to represent, or we don't. – Brian61354270 Mar 26 '23 at 22:34

1 Answers1

1

Your input was written as pure ASCII with the default ensure_ascii=True which generates the Unicode escape codes. Simply re-write the JSON with ensure_ascii=False and it will be readable. It matches the desired output:

import json

with open('input.json', encoding='ascii') as f:
    data = json.load(f)

with open('output.json', 'w', encoding='utf8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

Output:

{
  "ScriptMethod": [
    {
      "Address": 25744128,
      "Name": "간갨_TypeInfo",
      "Signature": "___505_c*"
    },
    {
      "Address": 25744184,
      "Name": "System.Xml.AttributePSVIInfo_TypeInfo",
      "Signature": "System_Xml_AttributePSVIInfo_c*"
    },
    {
      "Address": 25744216,
      "Name": "간갬_TypeInfo",
      "Signature": "___509_c*"
    },
    {
      "Address": 19592624,
      "Name": "갃갩.갃갨$$갈가",
      "Signature": "void _________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "vii"
    },
    {
      "Address": 2331904,
      "Name": "갃갩.갃갨$$갂갑갽",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 2331904,
      "Name": "갃갩.갃갨$$갂갂갡",
      "Signature": "Il2CppObject* __________ (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    },
    {
      "Address": 19590352,
      "Name": "갃갩.갃갨$$MoveNext",
      "Signature": "bool _______MoveNext (______39_o* __this, const MethodInfo* method);",
      "TypeSignature": "iii"
    }
  ]
}
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251