Running Tika through tika-python in Windows produces encoding errors

Question

I have python code that extracts text from pdf files using Tika Server through tika-python. It then stores the resulting output in individual json files.

The command I run to execute my script is

python extraction.py <full path to some local directory>

I'm using python 3.5

It works perfect in different MacBookPro computers. It doesn´t work as expected in Windows, even using up-to-date Windows 10.

Some pdf files are processed, others produce an error such as:

'charmap' codec can't encode characters in position 3648-3649: character maps to <undefined>

I have tried changing the Code Page to 65001 and changing console font to Lucida Console, based on other questions posted on Stack Overflow, including 388490 (Unicode characters in Windows command line - how?) and 14109024 (How to make Unicode charset in cmd.exe by default?) and 1259084 (What encoding/code page is cmd.exe using?).

I also tried installing ConEmu (http://conemu.github.io/en/UnicodeSupport.html) and changing the default encoding for all consoles.

Other references mention win_unicode_console (https://github.com/Drekin/win-unicode-console) but the python patch recommended instructions are not working in my machine.

I use Anaconda as my python distribution.

I am interested in knowing how to be able to run my python code in Windows without having these encoding problems. From what I have read, this is not a problem with my python code nor with Tika Server but rather a Windows encoding issue.

Thank you all,

German

Please provide the complete traceback so that the context of the error can be determined. It may have nothing to do with printing to the console. — Eryk Sun, Aug 24 '16 at 04:59
I normally run my code using try except blocks, so code executes. It just generates lots of useless json files. If I did not try-except, the traceback would look like the picture in next link. (http://i.stack.imgur.com/Z68hh.png) — gamunozbravo, Aug 25 '16 at 13:59
The problem is unrelated to the console. Text files default to the system's preferred encoding. As of the 3.5, `locale.getpreferredencoding()` returns the ANSI encoding from the user's Windows locale setting (e.g. codepage 1252). You need to modify how `out` is opened in extraction.py to add the parameter `encoding='utf-8'`. (For 3.6 or 3.7 a proposal to change the preferred encoding on Windows to UTF-8 is currently being debated.) — Eryk Sun, Aug 25 '16 at 18:03
Your comment solved my problem. It is now working as expected. Also, I thought that having coding: utf-8 comment at the top of a script was supposed to specify encoding, but evidently I was wrong. — gamunozbravo, Aug 26 '16 at 22:46
The coding spec on the first line of a script is to specify the encoding of the script file itself. For Python 3 the default encoding for scripts is UTF-8, so the coding spec isn't required for a UTF-8 file. — Eryk Sun, Aug 27 '16 at 01:37
Shouted success before completion...: hours later I got another UnicodeEncodeError, this time associated to the print statement that I include in my try-except blocks to check if a file was correctly processed. — gamunozbravo, Aug 27 '16 at 13:17
The new error is due to the console codepage. Enable `win_unicode_console` to solve that problem. — Eryk Sun, Aug 27 '16 at 17:51

Running Tika through tika-python in Windows produces encoding errors

0 Answers0