2

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).

I get this error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to

So I tried print(pred_str.encode('utf-8')) and my output looks like this:

b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m' b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham' b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5' b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'

But, I want my output to look like this:

pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām aviparīta-pudgala-dharma-nairātmya-pratipādana-artham triṃśikā-vijñapti-prakaraṇa-ārambhaḥ pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham

If i save my string in file using:

with codecs.open('out.txt', 'w', 'UTF-8') as f:
    f.write(pred_str)

it saves string as expected.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
h s
  • 404
  • 1
  • 5
  • 17

2 Answers2

8

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.

This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.

You can decode such bytestrings like this:

>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)                                                                                                         
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām 

To read such data from a file:

with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
    text = f.read()

Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

tripleee
  • 175,061
  • 34
  • 275
  • 318
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • ohh my.... it took me hours of trying to encode decode in multiple ways, to finally understand that this damn excel was using its own flavor of utf8. Thanks for the answer! – Vincent Teyssier Dec 27 '21 at 03:26
-2

try this code:

if pred_str.startswith('\ufeff'):
    pred_str = pred_str.split('\ufeff')[1]