8

The Problem

I'm working on cleaning up some old Korean code, and there are some sections of code that used to be Korean that I would like to translate to English. However, there seems to have been an encoding issue, and the text is no longer Korean. Instead, it's a garbled mess.

I would like to go from the broken string to an English translation.

My plan is to start with the broken string, encode it to binary using the codec that was used to decode the broken string on my computer, decode that binary to Korean using a Korean codec, and google translate that Korean into English. The issue is I have no idea how to decode this mess into readable Korean.

What I've tried

I started writing some Python3 code to work on translating this, but I keep getting hit with encoding errors, and honestly, I don't know where to start. This code was written with the assumption that the Korean used the cp949 codec, which I don't know for sure.

fileIn = open('Broken_Korean.txt', 'r', encoding='cp949')
fileOut = open('Fixed_Korean.txt', 'w')

Lines = fileIn.readlines()
for line in Lines:
    fileOut.write(str(line.encode('cp949')))
    fileOut.write('\n')
    fileOut.write(line.encode('cp949').decode('utf-8'))

I've also researched this issue, but I haven't found anything groundbreaking. I believe the codec used to display the broken strings is UTF-8, but I could be mistaken. I don't know how the original Korean was written, except that it was written using a "multi-byte encoding scheme (MBCS)". For context, the program this was written in is LabVIEW 2015. Presumably, they used a Korean version when they wrote the initial code.

Some examples of the broken strings:

ÆÄÀÏ ´ëÈ­ »óÀÚ5

ÆÄÀÏ ´ëÈ­ »óÀÚ6

ÆÄÀÏ ´ëÈ­ »óÀÚ

Luckily, some of the encoding errors happened on enums, so I was able to find the English translation. Using that translation, I can guess what the Koran might have been, but I'm not certain. I think this might help me deduce the codecs used, but I don't know how to do it.

À¯ÇÑ »ùÇà = Finite Samples > 유한 샘플

¿¬¼Ó »ùÇà = Continuous Samples > 연속 샘플

Çϵå¿þ¾î ŸÀֿ̹¡ ÀÇÇÑ ´ÜÀÏ Æ÷ÀÎÆ® = Hardware Timed Single Point > 하드웨어 타이밍 단일 포인트

Any help on working with encoding or tips on how to solve this would be greatly appreciated!! I'm very lost right now.

Edit: Here is a hex dump of some of the broken strings:

Broken_Korean.txt

ÆÄÀÏ ´ëÈ­ »óÀÚ5
ÆÄÀÏ ´ëÈ­ »óÀÚ6
ÆÄÀÏ ´ëÈ­ »óÀÚ
À¯ÇÑ »ùÇÃ
¿¬¼Ó »ùÇÃ
Çϵå¿þ¾î ŸÀֿ̹¡ ÀÇÇÑ ´ÜÀÏ Æ÷ÀÎÆ®
hexdump -C Broken_Korean.txt                                       
000000  c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2  ........ .......                                               
000010  ad 20 c2 bb c3 b3 c3 80 c3 9a 35 0d 0a c3 86 c3  . ........5.....                                               
000020  84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2  ..... ........ .                                               
000030  bb c3 b3 c3 80 c3 9a 36 0d 0a c3 86 c3 84 c3 80  .......6........                                               
000040  c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3  .. ........ ....                                               
000050  c3 80 c3 9a 0d 0a c3 80 c2 af c3 87 c3 91 20 c2  .............. .                                               
000060  bb c3 b9 c3 87 c3 83 0d 0a c2 bf c2 ac c2 bc c3  ................                                               
000070  93 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c3 87 c3 8f  . ..............                                               
000080  c2 b5 c3 a5 c2 bf c3 be c2 be c3 ae 20 c3 85 c2  ............ ...                                               
000090  b8 c3 80 c3 8c c2 b9 c3 96 c2 bf c2 a1 20 c3 80  ............. ..                                               
0000a0  c3 87 c3 87 c3 91 20 c2 b4 c3 9c c3 80 c3 8f 20  ...... ........                                                
0000b0  c3 86 c3 b7 c3 80 c3 8e c3 86 c2 ae              ............     

Zico
  • 127
  • 16
  • 1
    Check out the `chardet` library, it can help you identify the original encoding, or [bs4's UnicodeDammit](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#unicode-dammit) to automatically guess an encoding and convert to utf8. – SuperStormer Mar 13 '21 at 18:25
  • 1
    I tired decoding `ÆÄÀÏ ´ëÈ­ »óÀÚ5` and I get `UnicodeEncodeError: 'cp949' codec can't encode character '\xc7' in position 0: illegal multibyte sequence`. Do you face the same issue? (happens with other strings too) – Razzle Shazl Mar 13 '21 at 18:30
  • @RazzleShazl I did get that error, along with other ones. I assumed that was because my code was bad. – Zico Mar 13 '21 at 18:42
  • @SuperStormer Those are some good tips! I tried using chardet to detect the encoding, but it just tells me utf-8. I'm about to try UnicodeDammit and I'll report back when I do! – Zico Mar 13 '21 at 18:47
  • 3
    Can you post a hexdump of some sample text? – SuperStormer Mar 13 '21 at 18:48
  • 1
    I know DPRK created their own operating system *Red Star OS* and had a lot of in-house development. Is it possible they came up with their own archaic encoding which you found? Just guessing. – Razzle Shazl Mar 13 '21 at 18:58
  • @SuperStormer Unfortunaltly, UnicodeDammit keeps trying to tell me this is ascii... idk why. – Zico Mar 13 '21 at 19:03
  • 1
    Someone might correct me: The Korean comments have been saved to file into a byte sequence. The choice of bytes was determined by the codec used at the time (e.g. '연속 샘플'.encode('magic_encoding')). What you have shown us are the strings created from decoding the bytes using a different encoding. I re-encoded those strings using utf-8 (to get bytes again) and tried decoding to ['iso2022_kr', 'euc_kr'] but both failed. I think you need a hex dump of the sample text to share with us as @SuperStormer just mentioned – Razzle Shazl Mar 13 '21 at 19:07
  • @SuperStormer I edited my question to include the hex dump. – Zico Mar 13 '21 at 19:07
  • @RazzleShazl I edited my question to include the hex dump, and fortunately, this isn't some DPRK code haha! It's KR (South Korean) code. – Zico Mar 13 '21 at 19:08
  • 2
    I think you need a hex dump of the file. What you have here is hexdump of the bytes from decoding your Korean comments (from source file) using incorrect codec, then re-saved to a new file `Broken_Korean.txt`, Someone might correct me. – Razzle Shazl Mar 13 '21 at 19:08
  • @RazzleShazl What I posted was a hexdump of a file with 6 of the broken strings. I didn't encode or decode those strings, those are directly copy-pasted from the code. I would just hexdump the code itself and give you snippets of that, but it's LabVIEW code, so that wouldn't be very helpful unfortunately. – Zico Mar 13 '21 at 19:11
  • 2
    Do it anyways, it should be more helpful than a hexdump of a incorrect reencoding. – SuperStormer Mar 13 '21 at 23:22

1 Answers1

5

The data in the hexdump was likely read as ISO-8859-1 (a.k.a Latin-1) and re-saved as UTF-8. To reverse, decode as UTF-8 to obtain th original cp939 byte values, but in a Unicode string as Unicode code points. The latin1 codec occupies the first 256 code points, and encoding with it gives a byte string with the same byte values. Then the correct codec can be applied to decode back to a Unicode string:

data = bytes.fromhex('''
c3 86 c3 84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2
ad 20 c2 bb c3 b3 c3 80 c3 9a 35 0d 0a c3 86 c3
84 c3 80 c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2
bb c3 b3 c3 80 c3 9a 36 0d 0a c3 86 c3 84 c3 80
c3 8f 20 c2 b4 c3 ab c3 88 c2 ad 20 c2 bb c3 b3
c3 80 c3 9a 0d 0a c3 80 c2 af c3 87 c3 91 20 c2
bb c3 b9 c3 87 c3 83 0d 0a c2 bf c2 ac c2 bc c3
93 20 c2 bb c3 b9 c3 87 c3 83 0d 0a c3 87 c3 8f
c2 b5 c3 a5 c2 bf c3 be c2 be c3 ae 20 c3 85 c2
b8 c3 80 c3 8c c2 b9 c3 96 c2 bf c2 a1 20 c3 80
c3 87 c3 87 c3 91 20 c2 b4 c3 9c c3 80 c3 8f 20
c3 86 c3 b7 c3 80 c3 8e c3 86 c2 ae
''')

fixed = data.decode('utf8').encode('latin1').decode('cp949')
print(fixed)

Output:

파일 대화 상자5
파일 대화 상자6
파일 대화 상자
유한 샘플
연속 샘플
하드웨어 타이밍에 의한 단일 포인트

Translation (Google Translate):

File Dialog 5
File Dialog 6
File dialog
Finite sample
Continuous sample
Single point by hardware timing

If starting from a file, read the file as UTF-8, apply the fix, and write it back as (correct) UTF-8:

with open('Broken_Korean.txt', 'r', encoding='utf8') as f:
    data = f.read().encode('latin1').decode('cp949')

with open('Fixed_Korean.txt', 'w', encoding='utf8') as f:
    f.write(data)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Wow!! How did you determine that it was Latin-1 in the middle?? – Zico Mar 14 '21 at 15:14
  • 2
    @Zico It's not necessarily. It's just that after decode, you have a Unicode string, but it needs to be a bytes string to decode again, and `.encode('latin1')` translates code point X (where X is U+0000 to U+00FF) to byte X (0x00 to 0xFF). The first 256 Unicode code points are the Latin-1 character set. It's often a first try because it can decode anything. Sometimes it needs to be `cp1252` (the ANSI default on Windows). – Mark Tolonen Mar 14 '21 at 18:33
  • 1
    This is amazing. I used this to find the name of a song on an [internet radio station](http://wowccm.iptime.org:8000) that doesn't understand encoding. The string `¹Ú¼öÁø - ´Ã »ì¾Æ°è½Ã³×` converts to `박수진 - 늘 살아계시네`. _Crazy_. – WEBjuju Jun 14 '23 at 17:35