How to read excel Unicode characters using Python

Question

I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".

My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.

When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)

Being more concrete I perform the following:

Read the Excel:

  xlApp = Dispatch("Excel.Application")

  excel = xlApp.Workbooks.Open(excel_location)

in an internal loop I extract the value of the cell:

cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)

and finally, if I try to convert cell_value to str, crashes:

print str(cell_value)

If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.

What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.

Any idea how can I progress?

@PadraicCunningham when I created my code was the proposed solution I read. I tried already unicode but is not working. Do you have another proposal? — Trebia Project., Apr 21 '15 at 09:28
@TrebiaProject., it is already unicode. What happens just printing cell_value? — Padraic Cunningham, Apr 21 '15 at 09:49
@AaronDigulla repr() provides back > and not the actual cell value, thanks anyhow. — Trebia Project., Apr 21 '15 at 09:52
@PadraicCunningham is giving the exception I mentioned before, not printing it and killing the exceution. — Trebia Project., Apr 21 '15 at 09:54
@PadraicCunningham could you be more explicit in your question? Calling str() where? As in my problem description I am calling it to variable cell_value already and the call of str() is the one providing the exception. Do you refer to repr()? That will provide the same value. — Trebia Project., Apr 21 '15 at 09:59
I meant what happens when you `print(cell_value)` with no string call. — Padraic Cunningham, Apr 21 '15 at 10:00
@PadraicCunningham same error. Thanks a lot for your effort. — Trebia Project., Apr 21 '15 at 10:08
when you used `print(cell_value.encode("utf-8"))` you also did not use str yes? — Padraic Cunningham, Apr 21 '15 at 10:13
@PadraicCunningham enconde is not recognized as a method of cell_value. — Trebia Project., Apr 21 '15 at 13:20
@Huan-YuTseng that was one of the suggestions I found investigating... is not working — Trebia Project., Apr 21 '15 at 13:21
@TrebiaProject., is there a .Value attribute or some other attribute that returns the value for the cell? — Padraic Cunningham, Apr 21 '15 at 13:30
@PadraicCunningham yes... what I am doing with all the cells but 3 or 4. Acutally if I modify one by one the cells everything works fine... but I can do it with the hundred of files I need to process. — Trebia Project., Apr 21 '15 at 13:40
I am not overly familiar with the how it works, what exactly does `cell_value` represent? — Padraic Cunningham, Apr 21 '15 at 13:49
Maybe this link helps: [http://www.ianbicking.org/illusive-setdefaultencoding.html](http://www.ianbicking.org/illusive-setdefaultencoding.html) — Huan-Yu Tseng, Apr 21 '15 at 15:43
@Huan-YuTseng that did not help me directly, but allow me to find the solution giving me a hint!!!! The problem looks like to be in LiClipse — Trebia Project., Apr 22 '15 at 07:15

score 3 · Accepted Answer · edited May 23 '17 at 11:52

This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.

The source of the error is print: In Python 2.x, print doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).

Now you convert a COMObject to a string. str is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.

Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1 markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).

That's when the ascii encoder says: Sorry, can't help you there.

That means you can work with this value, compare it and such, but you can't print it. A simple fix for the printing problem is:

s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii

If your terminal supports UTF-8, then you need to tell Python about it:

import sys
import codecs

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

You should also have a look at sys.stdout.encoding which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.

None of this is working. Actually I already tried some of them before declaring this problem. — Trebia Project., Apr 21 '15 at 13:38
Please watch the talk "Pragmatic Unicode" which I linked. It should help you understand why it breaks. I also suggests to update to the latest Python 2.7. — Aaron Digulla, Apr 21 '15 at 14:15
I read it and anything described is already available in other places. Upgrading to python 2.7 is a major task.... is more a hack than a solution — Trebia Project., Apr 22 '15 at 07:08
actually the solution is your very last link, but when I introduced it in LiClipse gave me an error and I made a double check executing code with error!. Just to help potential readers with the same problem I made more explicit your solution, but you gave the solution! Solved! — Trebia Project., Apr 22 '15 at 07:28

Mark Tolonen · Answer 2 · 2015-04-22T13:47:39.023

3

.Cells(row,col) returns a Range object. You probably want the text from the cell:

cell = xl.ActiveSheet.Cells(1,2).Text

or

cell = xl.ActiveSheet.Range('B1').Text

The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding), for example:

bytes = cell.encode('utf8')

The below example uses the following spreadsheet:

enter image description here

import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value

Output (Note, Chinese will only print if console/IDE supports the characters):

<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人

edited Apr 22 '15 at 13:47

answered Apr 21 '15 at 15:53

Mark Tolonen

166,664
26
169
251

Unfourtunately is not working. Just in case somebody is reading this in the future I modified in your solution one small thing that is not working. cell does not allow to call the Value method. – Trebia Project. Apr 22 '15 at 07:10
@TrebiaProject., describe "not working" because I tested it and that is a cut-n-paste of the output. As you can see, it returns a `Range` object, which Microsoft documents as supporting the `Value` property: https://msdn.microsoft.com/EN-US/library/office/ff838238.aspx – Mark Tolonen Apr 22 '15 at 13:34
@TrebiaProject, `cell.Text` is also appropriate and described in the MSDN docs as returning a string. `cell.Value` returns a Variant that may not be handled correctly on your older Python/pywin32 installation. I'm using Python 2.7/pywin32 219. I've updated my answer to use `.Text`. – Mark Tolonen Apr 22 '15 at 13:46
my mistake, I should have been more explicit. The problem keeps the same, exactly same output. I found that the proble is actually the console. – Trebia Project. Apr 22 '15 at 13:56

score 2 · Answer 3 · edited May 23 '17 at 12:31

2

What is described here is a hack, you should not use as a long term solution. Looking at the comments it could crush the terminal.

Finally I found a solution helped by the suggestion that @Huan-YuTseng provided, probably the solutions offered by other might work in other context but not in this one.

So, what happened is that I migrated from Eclipse Juno version (as Pydev stopped working due to Java upgrade needed that I can't accomplish in this computer) to LiClipse direct package (I did not upgraded a downloaded Eclipse version).

By default, in my LiClipse version (1.4.0.201502042042) the Console output is not by default utf-8. So I needed to change the output from either LiClipse or using my code. Fourtunately, there was another question related to a similar problem that helped me. You can see more details here, but essentially what you need to do is to include at the begginning of your code the following code:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

And everything works. In the answers from @AarongDigulla the solution is there, but is actually the very last solution.

However, I need to say that LiClipse is giving me an error on sys.setdefaultencoding statement, that during execution is not creating any issue... no idea what's happening. That stopped me testing this solution before. Maybe there is something wrong in LiClipse (is alowing me to execute code with errors!)

edited May 23 '17 at 12:31

Community

1
1

answered Apr 22 '15 at 07:27

Trebia Project.

930
2
17
36

It shouldn't be necessary to do this in the Python script. Since LiClipse is a successor of PyDev, does this help? http://stackoverflow.com/questions/3023972/printing-unicode-in-eclipse-pydev-console-and-in-idle – Aaron Digulla Apr 22 '15 at 09:11
@AaronDigulla thanks, actually that's part of the problem, several options available in Eclipse are not in LiClipse. For example it is not possible to select Run Configuration (which shocked me), the default configuration of LiClipse and Eclipse are different. – Trebia Project. Apr 22 '15 at 12:08
This changes the default for all Python modules and is not recommended. Some modules rely on this default. – Mark Tolonen Apr 22 '15 at 13:26
@TrebiaProject, if fact, in the code of my answer, trying out the `reload(sys)` trick in my IDE made the IDE stop working...a good example of why not to change the default. It did, however, make a `Range` object printable, but the right way is to use `cell.Text` to retrieve the text of the `Range` object. – Mark Tolonen Apr 22 '15 at 13:50
@MarkTolonen it is working for me at the moment. So far, I have not been able to find the exact LiClipse configuration ot make it work... still investigating, when I find it I will provide the info for hte community but I am unblocked on the short term. – Trebia Project. Apr 22 '15 at 13:57
@MarkTolonen I will make that comment explicit in the answer in case anybody sees it. – Trebia Project. Apr 22 '15 at 13:58

score 0 · Answer 4 · answered Jun 12 '19 at 11:19

0

Use 'utf-8 BOM' which in python used as utf_8_sig for Unicode character & also to avoid irrelevant results in Excel sheet.

answered Jun 12 '19 at 11:19

Sijin John

492
7
15

How to read excel Unicode characters using Python

4 Answers4

Linked