1

I have been following this answer on how enable Python (2.7) to correctly receive file names from the Windows command line that have names such as 'canção.pdf', '조선.pdf' or 'मान.pdf'.

My bat file (which is in shell:sendto) is as follows (as advised here):

@echo off
@chcp 65001 > nul
@set PYTHONIOENCODING=utf-8
python "D:\Dropbox\Python\print_file_name.py" %1
pause

My python script at moment just tries to prints these file names:

sys.argv = win32_unicode_argv()
file_name = sys.argv[1].encode(sys.stdout.encoding)
print file_name

win32_unicode_argv() is a method described here.

Even thou I am able to print 'canção.pdf' correctly, I'm still not able to print either '조선.pdf' or 'मान.pdf'. Any advice on how to tackle this issue?

Community
  • 1
  • 1
Augusto
  • 115
  • 1
  • 10
  • [`win32_unicode_argv()` from @Craig McQueen's answer](http://stackoverflow.com/a/846931/145400) looks correct (and it uses the right approach `CommandLineToArgvW`). The bat-file with `chcp 65001` and `PYTHONIOENCODING=utf-8` try to fix the Unicode output to the Windows console (you need to set a font that can show the characters you are interested in). The most successful approach that I know is: `print(unicode_text)` (don't encode in your script). Use [`WriteConsoleW` if you want to print arbitrary Unicode characters to Windows console](http://stackoverflow.com/a/3259271/4279).. – jfs May 14 '15 at 20:03
  • ..[continued] configure the font but leave `chcp` along. Use `PYTHONIOENCODING` if you want to redirect the output to a file. – jfs May 14 '15 at 20:05
  • Yeah `chcp 65001` breaks a lot of apps (including Python) due to bugs in the MS C runtime. Avoid it. Printing and reading unicode from the console in Windows is a notorious problem for this reason. `WriteConsoleW` via `ctypes` is the most reliable solution if you really *must* but obviously it isn't portable. – bobince May 14 '15 at 20:13
  • @bobince, the problems with codepage 65001 aren't in the C runtime. They're in the console host process (conhost.exe). Its design assumes an ANSI or DBCS codepage, which for a variable encoding such as UTF-8 can lead to an unchecked encoding failure for `ReadConsoleA` (even in Windows 10 -- which I just checked with a debugger attached to conhost.exe and breakpoint on `WideCharToMultiByte`) and a misreported written count for `WriteConsoleA` (fixed in Windows 8 and 10). – Eryk Sun May 15 '15 at 08:22

1 Answers1

0

The file name is being received correctly. You can verify this by encoding sys.argv[1] as UTF-8 and writing it to a file (opened in binary mode) and then opening the file in a text editor that supports UTF-8.

The Windows command prompt is unable to display the characters correctly despite the 'chcp' command changing the codepage to UTF-8 because the terminal font does not contain those characters. The command prompt is unable to substitute characters from other fonts.

codewarrior
  • 2,000
  • 14
  • 14
  • I've followed your advice and was almost successful. I've: removed `@set PYTHONIOENCODING=utf-8` from my bat file; changed `full_path = sys.argv[1].encode(sys.stdout.encoding)` to `full_path = sys.argv[1].encode('utf-8')`; and added `f = open('test.txt', 'wb')` and `f.write(full_path)`. I can now print both **'canção.pdf'** and **'조선.pdf'**, but not **'मान.pdf'**. I am opening _test.txt_ with Notepad++ using UTF-8 encoding. – Augusto May 14 '15 at 15:53
  • When you open *test.txt* with Notepad++, does it say "OEM 866" in the status bar? – codewarrior May 15 '15 at 00:03
  • I was able to open the file containing "मान.pdf" in PSPad and Notepad where it displayed correctly both times. It also displays correctly in the "Preview" panel of a Choose File dialog in Notepad++. However, Notepad++ always wants to open the file as OEM 866 and doesn't seem to have an option that says "interpret this file as UTF-8 encoding". – codewarrior May 15 '15 at 00:05
  • Disabling the "Autodetect Character Encoding" option in Notepad++ will cause the file to be opened as UTF-8 by default. However, there is still no "Open File With Encoding" command or any way to explicitly tell Notepad++ which encoding to use when reading the file - the options under the Encoding menu only seem to affect what it does when writing the file. – codewarrior May 15 '15 at 00:11