I have python code that extracts text from pdf files using Tika Server through tika-python. It then stores the resulting output in individual json files.
The command I run to execute my script is
python extraction.py <full path to some local directory>
I'm using python 3.5
It works perfect in different MacBookPro computers. It doesn´t work as expected in Windows, even using up-to-date Windows 10.
Some pdf files are processed, others produce an error such as:
'charmap' codec can't encode characters in position 3648-3649: character maps to <undefined>
I have tried changing the Code Page to 65001 and changing console font to Lucida Console, based on other questions posted on Stack Overflow, including 388490 (Unicode characters in Windows command line - how?) and 14109024 (How to make Unicode charset in cmd.exe by default?) and 1259084 (What encoding/code page is cmd.exe using?).
I also tried installing ConEmu (http://conemu.github.io/en/UnicodeSupport.html) and changing the default encoding for all consoles.
Other references mention win_unicode_console (https://github.com/Drekin/win-unicode-console) but the python patch recommended instructions are not working in my machine.
I use Anaconda as my python distribution.
I am interested in knowing how to be able to run my python code in Windows without having these encoding problems. From what I have read, this is not a problem with my python code nor with Tika Server but rather a Windows encoding issue.
Thank you all,
German