0

I am parsing multiple worksheets of unicode data and creating a dictionary for specific cells in each sheet but I am having trouble decoding the unicode data. The small snippet of the code is below

for key in shtDict:
    sht = wb[key] 
    for row in sht.iter_rows('A:A',row_offset = 1):
        for cell in row:
            if isinstance(cell.value,unicode):
                if "INC" in cell.value:
                    shtDict[key] = cell.value

The output of this section is:

{'60071508': u'\ufeffReason: INC8595939', '60074426': u'\ufeffReason. Ref INC8610481', '60071539': u'\ufeffReason: INC8603621'}

I tried to properly decode the data based on u'\ufeff' in Python string, by changing the last line to:

shtDict[key] = cell.value.decode('utf-8-sig')

But I get the following error:

Traceback (most recent call last):
  File "", line 55, in <module>
    shtDict[key] = cell.value.decode('utf-8-sig')
  File "C:\Python27\lib\encodings\utf_8_sig.py", line 22, in decode
    (output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)

Not sure what the issue is, I have also tried decoding with 'utf-16', but I get the same error. Can anyone help with this?

mickNeill
  • 346
  • 5
  • 22
  • 1
    You use `decode()` to go from an encoded string to unicode. Hence, you don't need to try and decode anything that is already unicode. – Charlie Clark Apr 26 '18 at 13:18

1 Answers1

3

Just make it simpler: you can ignore BOF, so just ignore BOF characters.

shtDict[key] = cell.value.replace(u'\ufeff', '', 1)

Note: cell.value is already unicode type (you just checked it), so you cannot decode it again.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32