I am aware that there are plenty of discussions on the "UTF-8" encoding issue on Python 2 but I was unable to find a solution to my problem so far. I am currently creating a script to get the name of a file and hyperlink it in xlwt, so that the file can be accessed by clicks in the spreadsheet. Problem is, some of the names of these files include non-ASCII characters.
Question 1
I used the following line to retrieve the name of the file. There is only one file in the folder by the way.
>>f = filter(os.path.isfile, os.listdir(tmp_path))[0]
And then
>>print f
'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
>>print sys.stdout.encoding
'UTF-8'
>>f.decode("UTF-8")
*** UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 76: invalid continuation byte
From browsing the discussions here, I realized that "\xe7\xe3o" is not a "UTF-8" encoding. Running the following line seems to back this point.
>>f.decode("latin-1")
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
My question is then, why is the variable f being encoded in "latin-1" when the system encoding is set to "UTF-8"?
Question 2
While f.decode("latin-1") gives me the output that I want, I am still unable to supply the variable to the hyperlink function in the spreadsheet.
>>data.append(["File", xlwt.Formula('HYPERLINK("%s";"%s")' % (os.path.join(dl_path,f.decode("latin-1")),f.decode("latin-1")))])
*** FormulaParseException: can't parse formula HYPERLINK("u'H:\\Mad Lab\\SE Doc Crawler\\bovespa\\download\\521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's;"u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's)
Apparently, the closing double quote got eaten up and was replaced by a " 's" suffix. Can somebody help to explain what's going on here? 0.0
Oh and if someone can suggest a solution to Question 2 above then I will be very grateful - for you would have saved my weekend from misery!
Thanks in advance all!