0

While trying to implement codes given as examples in a book for NLTK in python running directly on PowerShell, some characters are not getting printed. The version of Python is 3.6.0 and the encoding is thus UTF-8 as needed. The problem is that the command line output of a text encoded in UTF-8 is not being displayed because of probably a different console encoding.

I think I saw one post similar to this which was enquiring about Russian letters but it was specific to Java and Linux. It gave me the idea to look for console encoding settings and changing it to UTF-8. But I am unable to find those settings.

>>> import nltk
>>> nltk.download('cess_esp')
>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricité_de_France', ...]
>>> nltk.download('indian')
>>> nltk.corpus.indian.words()
['মহিষের', 'সন্তান', ':', 'তোড়া', 'উপজাতি', '৷', ...]

As shown in the code, I try to print out 2 kinds of words, Spanish and Indian (Devnagri). But only the output for Spanish is printed out correctly, while for Indian it shows blank boxes/squares in place of the letters. However, when I copy and paste the 'blank-boxes output' for Indian in Chrome address bar or in this post, for example, it prints it out correctly.

Edit: The suggested possible duplicate query (Displaying Unicode in Powershell) deals with the same problem except, it suggests the font that will work for Arabic, Chinese, Japanese, and Russian characters. I tried that font in my case as well, feeling a little lucky. Unfortunately, it didn't work!

Chholak
  • 3
  • 3
  • This is unrelated to Powershell (the Console host for Windows is actually what's causing this and since Powershell uses the console host....) – bluuf Jan 23 '19 at 11:11
  • 1
    Possible duplicate of [Displaying Unicode in Powershell](https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell) – marsze Jan 23 '19 at 11:12
  • @marsze it is indeed a similar issue, thanks. However, the method suggested there is for Russian characters and to my bad fortune, doesn't work for Indian characters. They suggest changing the font to SimSun-ExtB but it still doesn't display Indian characters. – Chholak Jan 24 '19 at 08:22
  • @Chholak Unicode is Unicode, no matter if Russian or Indian. What have you tried? – marsze Jan 24 '19 at 08:24
  • @marsze I understand Unicode won't differentiate between Russian or Indian but the issue seems to be in the font and not the Unicode encoding. I changed the font to SimSun-ExtB but it didn't work. I tried all other fonts present there as well but all of them didn't work. Is there a place where I can download font specific to Indian? – Chholak Jan 24 '19 at 08:43

0 Answers0