0

The string in arabic:

\u0627\u0644\u0634\u0628 \u0639\u0627\u0645\u0644 \u0634\u0648\u0641\u064a\u0631 \u0627\u0635\u0646\u0635\u064a\u0631 \u0648\u0639\u0645 \u064a\u062d\u0627\u0633\u0628 \u0627\u0644\u0637\u0627\u0644\u0639 \u0648\u0627\u0644\u0646\u0627\u0632\u0644 \u0634\u0648\u0641\u0648 \u0634\u0648 \u0635\u0627\u0631 \u0627\u062e\u0631 \u0634\u064a \ud83d\ude02\u2764\ufe0f

Real text is : الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️

My code:

thumm= '\u0627\u0644\u0634\u0628 \u0639\u0627\u0645\u0644 \u0634\u0648\u0641\u064a\u0631 \u0627\u0635\u0646\u0635\u064a\u0631 \u0648\u0639\u0645 \u064a\u062d\u0627\u0633\u0628 \u0627\u0644\u0637\u0627\u0644\u0639 \u0648\u0627\u0644\u0646\u0627\u0632\u0644 \u0634\u0648\u0641\u0648 \u0634\u0648 \u0635\u0627\u0631 \u0627\u062e\u0631 \u0634\u064a \ud83d\ude02\u2764\ufe0f'

Text= bytes(thumm, "utf-8").decode("unicode_escape",errors='surrogatepass')
print(Text)

Every time i run code the error is

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 67-68: surrogates not allowed

I tryed every soulation that you can imagine ..

Any one can help?

Adriaan
  • 17,741
  • 7
  • 42
  • 75
  • 1
    The error means the file is *not* UTF8, or at least, not something that can be handled by the `utf-8` encoding. UTF8 isn't some kind of escape sequence. This page is UTF8, which is why you were able to post the Arabic text and the emojis. Python strings are Unicode already which means you **can** just type `thumm = "الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️"` – Panagiotis Kanavos Jan 09 '23 at 13:00
  • What is the *actual* string? Not the escape sequences, but the actual string? Where does it come from? A file? Source code? An API response? Why convert it to a byte buffer named `Text` instead of just using the string? Those escape sequences are how the *debugger* displays text, not the actual contents of the string – Panagiotis Kanavos Jan 09 '23 at 13:01
  • String come from facebook scrap.. iwant to print title and appears as follows: الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️..... – Stephen Matthews Jan 09 '23 at 13:05
  • In that case don't try to "fix" anything. The string is already Unicode. Facebook, Stack Overflow and frankly almost all web sites, use UTF8 already. Your scraping code already received the FB page content and decoded it. – Panagiotis Kanavos Jan 09 '23 at 13:07
  • Why did you try to use `bytes(thumm, "utf-8").decode(` in the first place? What did you think this would fix? What's the *actual* problem? – Panagiotis Kanavos Jan 09 '23 at 13:07
  • actual string is \u0627\u0644\u0634\u0628 \u0639\u0627\u0645\u0644 \u0634\u0648\u0641\u064a\u0631 \u0627\u0635\u0646\u0635\u064a\u0631 \u0648\u0639\u0645 \u064a\u062d\u0627\u0633\u0628 \u0627\u0644\u0637\u0627\u0644\u0639 \u0648\u0627\u0644\u0646\u0627\u0632\u0644 \u0634\u0648\u0641\u0648 \u0634\u0648 \u0635\u0627\u0631 \u0627\u062e\u0631 \u0634\u064a \ud83d\ude02\u2764\ufe0f .. I think there is a problem with decoding the emojis – Stephen Matthews Jan 09 '23 at 13:08
  • No it's not. That's an escape sequence produced by your debugger or editor, not Unicode.. There's no problem decoding emojis. `>>> thumm = " الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️" >>> print(thumm)` works just fine. In my terminal, pasting that exact string displays error characters for the string *in the editor* but actually running the two lines prints the smiley – Panagiotis Kanavos Jan 09 '23 at 13:09
  • If you insist that Unicode is escape sequences open your browser's Developer tools with F12 and inspect the string you posted in your question. No escape sequences anywhere – Panagiotis Kanavos Jan 09 '23 at 13:12
  • 3
    Please don't make more work for other people by vandalizing your posts. By posting on the Stack Exchange network, you've granted a non-revocable right, under the [CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/), for Stack Exchange to distribute that content (i.e. regardless of your future choices). By Stack Exchange policy, the non-vandalized version of the post is the one which is distributed. Thus, any vandalism will be reverted. If you want to know more about deleting a post please see: [How does deleting work?](https://meta.stackexchange.com/q/5221) – Adriaan Jan 09 '23 at 14:07

1 Answers1

0

Unicode and UTF8 specifically aren't some kind of escape sequence. Python 3 strings are Unicode already, just like this page. StackOverflow uses UTF8, which is why the question contains Arabic text and emojis without special escaping.

The fact that Python 3 strings are Unicode means you can just type

thumm='الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️'
print(thumm)

The output will be

الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️

Whether the terminal or editor you use can display all characters is another matter. The text itself is fine though.

In my case, in a Windows Terminal shell, pasting the text appeared as :

 thumm = "الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ��❤️"

but the output was still

الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️

That's because the terminal had trouble displaying the smiley.

In VS Code, the code appeared just fine.

thumm='الشب عامل شوفير اصنصير وعم يحاسب الطالع والنازل شوفو شو صار اخر شي ❤️'
print(thumm)

Executing it produced the same output

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236