-1

I have a string which I get from a function

>>> example = Some_function()

This Some_function return a very long combination of Unicode and ASCII string like 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'
My Problem is that when I try to convert this unicode string to bytes it gives me an error that \ud919 cannot be encoded by utf-8. I tried :

>>> further=bytes(example,encoding='utf-8')

Note: I cannot ignore this \ud919. If there is a way to solve this problem or how can I convert 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123' to 'gn1\ud123a\ud123\ud123\ud123\\ud919\ud123\ud123' to treat \ud919 as simple string not unicode.

Yash Makan
  • 706
  • 1
  • 5
  • 17

2 Answers2

0

based on the version. print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string)

  • Hello and welcome to SO! Please read the [tour](https://stackoverflow.com/tour), and [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) – Tomer Shetah Feb 07 '21 at 13:47
0

\ud919 is a surrogate character, one does not simply convert it. Use surrogatepass flag:

'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'.encode('utf-8', 'surrogatepass')
>>> b'gn1\xed\x84\xa3a\xed\x84\xa3\xed\x84\xa3\xed\x84\xa3\xed\xa4\x99\xed\x84\xa3\xed\x84\xa3'
Alderven
  • 7,569
  • 5
  • 26
  • 38