2

I am using Python 2.7 on Windows 10 and am working with Korean text. My ultimate goal is to be able to import some Korean text, modify it and then write the new text to a file.

However, any Korean text I attempt to print to the terminal or write to a file ends up as a series of question marks.

For example, if I do the following

>>>print u'가다'

I get

??

I have tried printing as both '가다' and u'가다'. I have also tried two different encodings using sys.setdefaultencoding(ENCODING NAME). The encodings I have tried are "utf-8" and "iso 8859-15".

I tried print u'가다'.encode('utf-8') and print '가다'.encode('utf-8')

I tried seeing at what point the information is being lost by using ord and get the following.

>>> ord(u'가')
63

ord('가') and ord(u'가') both return 63, which is the same as ord('?'), so it seems whatever the problem is it's happening the moment I hit the enter button. The same happens if I save '가' or u'가' to a variable and get the ord of that variable.

I have no problem getting korean text to work in python 3, but I am using a korean language processing library that doesn't work in python 3 so switching to python 3 isn't an option for this situation. Any help would be much appreciated. Thank you in advance.

vulhar
  • 23
  • 3
  • 2
    `print u'가다'` works fine for me, using python 2.7 on windows 10. Also works for me on OSX and Ubuntu. – user3483203 Apr 07 '18 at 19:35
  • 1
    Are you sure this is a python issue and not a windows issue? Might be missing a language pack. – user3483203 Apr 07 '18 at 19:36
  • 2
    @chrisz, are you using IDLE? If not, what codepage do you have for CommandPrompt/PowerShell? What does `> chcp` print? – Terry Jan Reedy Apr 07 '18 at 19:50
  • Possible duplicate of [Python, Unicode, and the Windows console](https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console) – phuclv Apr 08 '18 at 00:38
  • [Python 2.7: output utf-8 in Windows console](https://stackoverflow.com/q/7078232/995714), [How to display utf-8 in windows console](https://stackoverflow.com/q/3578685/995714), [How to Output Unicode Strings on the Windows Console](https://stackoverflow.com/q/3130979/995714), [Proper way to print unicode characters to the console in Python when using inline scripts](https://stackoverflow.com/q/29316116/995714) – phuclv Apr 08 '18 at 00:40

1 Answers1

1

On Windows, including the newest Windows 10, both CommandPrompt and PowerShell restrict the characters they print to a 'codepage', which is usually 256 of the approximately 200,000 currently defined unicode characters. By default, your Windows is set to the codepage for the country where you buy it.

There exist a codepage for utf-8, but it is buggy and Microsoft refuses to fix it.

For 2.7, use IDLE to run your code and Korean characters will print fine, because the tcl/tk Text windows used by tkinter and hence IDLE support all of the first 2**16 characters.

Korean text works on Windows in 3.6+ because the interface to CommandPrompt was rewritten to not use the codepage setting. Urge the authors of the library you are using to produce a 3.6+ compatible version.

Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52