1

I have a csv file (see here) that contains meta data from posts of a public page in Facebook. I need to decode all the content like: \xc3\xa9 and \xf0\x9f\x91\xa9\xf0\x9f\x8f\xbb\xe2\x80\x8d\xf0\x9f\x92\xbc

The meta data "post message" is:

"b'Bom dia, genteee! Me disseram que esse emoji \xc3\xa9 a minha cara: \xf0\x9f\x91\xa9\xf0\x9f\x8f\xbb\xe2\x80\x8d\xf0\x9f\x92\xbc\nO que voc\xc3\xaas acham?'"

and its type is str object.

I need convert it to:

Bom dia, genteee! Me disseram que esse emoji é a minha cara: ‍ O que vocês acham?

How I do this? I need convert all csv.

edit 1: I tried

My_string = post_message.split("b'")[1].split("'")[0]
My_string.encode().decode('unicode_escape')

but the result it's different than I expected:

Bom dia, genteee! Me disseram que esse emoji é a minha cara: ð©ð»âð¼ O que vocês acham?

Solution:

As @Ben pointed out, my data is a string object that contains bytes, not bytes object. So used the @ShadowRanger solution (see his answer here). I did:

My_string = post_message[2:len(post_message)-1] #to remove "b'" from begining and "'" from end
My_string = My_string.encode('utf-8').decode('unicode_escape').encode('latin-1').decode('utf-8')

The result:

Bom dia, genteee! Me disseram que esse emoji é a minha cara: ‍ O que vocês acham?

  • Possible duplicate of [how do I .decode('string-escape') in Python3?](https://stackoverflow.com/questions/14820429/how-do-i-decodestring-escape-in-python3) – ShadowRanger Aug 18 '18 at 02:19
  • I tried that solution, but doesn't worked how I expected. – Guilherme Henrique Mendes Aug 18 '18 at 02:23
  • Ah, your particular case is the `repr` of a `bytes` object, missed that. [This answer to a similar question addresses that scenario](https://stackoverflow.com/a/1885211/364696). Can't edit my duplicate vote now, but this is a duplicate of that question (only difference is that there's is the `repr` of a `str`, yours is the `repr` of a `bytes`). – ShadowRanger Aug 18 '18 at 02:50
  • @ShadowRanger post [your answer](https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python/51904799#51904799) here that I accept it as the solution. You solved my problem. – Guilherme Henrique Mendes Aug 18 '18 at 11:52
  • "The meta data "post message" is:" Please show *how you get this data*. The problem should be fixed in that process instead. – Karl Knechtel Aug 05 '22 at 02:24

1 Answers1

2

I notice that the string you posted looks like "b'...'", with double quotes around a single quoted string with b prefixed. That looks like a string containing the text representation of a bytestring, as opposed to a bytestring being printed as text.

For example:

>>> text = 'föő'
>>> text
'föő'
>>> bytestring = text.encode()
>>> bytestring
b'f\xc3\xb6\xc5\x91'
>>> str(bytestring)
"b'f\\xc3\\xb6\\xc5\\x91'"

It suggests you had a bytestring at some point and called str on it (or something similar) to turn it into a text string. That gives you the text representation of the bytestring, not the text that the bytestring is the encoding of.

However, if that theory were entirely correct, you would have doubled backslashes, as you can see in my example above. So it doesn't entirely fit, if the data is exactly as you showed in the OP.

However, it still looks like code at some point had bytes and converted them to text incorrectly. I would strongly recommend you fix this by finding where that is happening and fixing it, rather than trying to correct this data after the fact.

Ben
  • 68,572
  • 20
  • 126
  • 174
  • 1
    Or if it happened elsewhere not under the OP's control (handed a text file with that garbage), use `ast.literal_eval` to reverse it. – ShadowRanger Aug 18 '18 at 02:47
  • thanks for the suggest. The pandas was reading the csv as str rather than bytes. – Guilherme Henrique Mendes Aug 18 '18 at 02:52
  • @GuilhermeHenriqueMendes Were you able to get pandas to read it as bytes instead? – Ben Aug 18 '18 at 02:56
  • @Ben I'm searching for. – Guilherme Henrique Mendes Aug 18 '18 at 03:08
  • @ShadowRanger The 'ast.literal_eval' it's raising an error saying 'ValueError('malformed node or string: ' + repr(node))'. Looks like the csv has some data than the lib doesn't parse. – Guilherme Henrique Mendes Aug 18 '18 at 03:11
  • @GuilhermeHenriqueMendes Yes, that's when I noticed that the text you gave is not actually the repr of a bytestring, since that would be displayed with escaped backslashes in it. But I get the error you quote only when I try to `ast.literal_eval` the bytestring; when I try the string value you posted, I get `SyntaxError: EOL while scanning string literal`. If you've actually got the bytes. just use `.decode()` – Ben Aug 18 '18 at 03:16
  • I got the bytes, finally. However, just decode() still keeping the backslash scapes, and decode('unicode_escape') doesn't shows correctly. e.g. decode() shows voc\xc3\xaas and decode('unicode_escape') shows vocês and I need get vocês. – Guilherme Henrique Mendes Aug 18 '18 at 03:25
  • @GuilhermeHenriqueMendes The bytes shown in your question can be decoded into the string you wanted (module the section with the 3 undisplayed box characters being displayed differently in my terminal). Are the bytes you have the same bytes shown there? – Ben Aug 18 '18 at 03:28
  • Although it sounds like you would be better off asking a new question, showing some of the CSV data and asking how to get it loaded the way you want (into a pandas dataframe I presume?). – Ben Aug 18 '18 at 03:30
  • In fact I need the data to analyze how many words each post message have, evaluate emotions, count characters and other textual analysis. Not necessarily in the data frame. – Guilherme Henrique Mendes Aug 18 '18 at 03:39
  • @GuilhermeHenriqueMendes: If `unicode-escape` is problematic because the data is UTF-8, sounds like you need [my new answer here](https://stackoverflow.com/a/51904799/364696). – ShadowRanger Aug 18 '18 at 03:47