0

I am aware that there are plenty of discussions on the "UTF-8" encoding issue on Python 2 but I was unable to find a solution to my problem so far. I am currently creating a script to get the name of a file and hyperlink it in xlwt, so that the file can be accessed by clicks in the spreadsheet. Problem is, some of the names of these files include non-ASCII characters.

Question 1

I used the following line to retrieve the name of the file. There is only one file in the folder by the way.

>>f = filter(os.path.isfile, os.listdir(tmp_path))[0]

And then

>>print f
'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
>>print sys.stdout.encoding
'UTF-8'
>>f.decode("UTF-8")
*** UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 76: invalid continuation byte

From browsing the discussions here, I realized that "\xe7\xe3o" is not a "UTF-8" encoding. Running the following line seems to back this point.

>>f.decode("latin-1")
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'

My question is then, why is the variable f being encoded in "latin-1" when the system encoding is set to "UTF-8"?

Question 2

While f.decode("latin-1") gives me the output that I want, I am still unable to supply the variable to the hyperlink function in the spreadsheet.

 >>data.append(["File", xlwt.Formula('HYPERLINK("%s";"%s")' % (os.path.join(dl_path,f.decode("latin-1")),f.decode("latin-1")))])
*** FormulaParseException: can't parse formula HYPERLINK("u'H:\\Mad Lab\\SE Doc Crawler\\bovespa\\download\\521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's;"u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's)

Apparently, the closing double quote got eaten up and was replaced by a " 's" suffix. Can somebody help to explain what's going on here? 0.0

Oh and if someone can suggest a solution to Question 2 above then I will be very grateful - for you would have saved my weekend from misery!

Thanks in advance all!

kerwei
  • 1,822
  • 1
  • 13
  • 22
  • The filesystem encoding can always differ from the locale (stdout and stdin encoding). Look at `sys.getfilesystemencoding()` instead. – Martijn Pieters Aug 19 '16 at 12:22
  • Can you tell me what operating system and if you're using the console or an IDE like IDLE, PyCharm, Intellij or Eclipse, so I can give you a specific answer? – Alastair McCormack Aug 21 '16 at 11:51
  • @MartijnPieters sys.getfilesystemencoding() throws "mbcs" – kerwei Aug 22 '16 at 02:17
  • @AlastairMcCormack I'm using the Windows 7, running from the PyCharm IDE. – kerwei Aug 22 '16 at 02:18
  • @kerwei: `mbcs` is one of the wide character encodings supported by Windows that is **not** UTF-8, see [Difference between MBCS and UTF-8 on Windows](http://stackoverflow.com/q/3298569). Just use the `'mbcs'` codec provided by Python. – Martijn Pieters Aug 22 '16 at 06:37

2 Answers2

0

Welcome to the confusing world of encoding! There's at least file encoding, terminal encoding and filename encoding to deal with, and all three could be different.

In Python 2.x, the goal is to get a Unicode string (different from str) from an encoded str. The problem is that you don't always know the encoding used for the str so it's difficult to decode it.

When using listdir() to get filenames, there's a documented but often overlooked quirk - if you pass a str to listdir() you get encoded strs back. These will be encoded according to your locale. On Windows these will be an 8bit character set, like windows-1252.

Alternatively, pass listdir() a Unicode string instead.

E.g.

os.listdir(u'C:\\mydir')

Note the u prefix

This will return properly decoded Unicode filenames. On Windows and OS X, this is pretty reliable as long your environment locale hasn't been messed with.

In your case, listdir() would return:

u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'

Again, note the u prefix. You can now print this to your PyCharm console with no modification.

E.g.

f = filter(os.path.isfile, os.listdir(tmp_path))[0]
print f
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • Thanks for the explanation! I'm reading files from the folder so I don't really have the choice of using the 'u' prefix with a string literal. Regardless, your answer pointed me in the right direction and I think I have a slightly better understanding of the encoding topic now thanks to [link](http://nedbatchelder.com/text/unipain.html). I will mark this as the solution. – kerwei Aug 26 '16 at 08:38
  • No problem. The point is, the very first time you pass the directory you want to list with `listdir()`, use a `u''` or a Unicode object created by other means. The results will be then Unicode. – Alastair McCormack Aug 26 '16 at 08:41
0

As for Question 2, I did not investigate further but just printed the output as unicode strings, rather than xlwt objects, due to time constraint. I'm able to continue with the project, though without the understanding of what went wrong here. In that sense, the 2 questions above have been answered.

kerwei
  • 1,822
  • 1
  • 13
  • 22