0

I am trying to decode the string bellow:

st='arroz e feij\xc3o, bife ao molho de tomate, pts com quiabo\r\nSALADA: alface, r\xdacula, rabanete e cebola\r\nSOBREMESA: ma\xc7\xc3\r\nSUCO: Amarelo 3\r\n o card\xc1pio cont\xc9m gl\xdaten no p\xc3o. n\xc3o cont\xc9m ovos e lactose. traga sua caneca!'

Using:

st.decode(?)

But I don't now the correct codec.

hildogjr
  • 754
  • 2
  • 6
  • 17
  • I can't guess encodings on-the-fly, maybe someone else has this fascinating capability. Have you tried `'utf8'`? – ForceBru Apr 03 '17 at 15:38
  • If you're using Windows, try `'mbcs'` which takes the current OS code page. – Mark Ransom Apr 03 '17 at 15:39
  • What should the decoded string look like? – Felix Apr 03 '17 at 15:39
  • @ForceBru you can tell it's not utf8 because the hex bytes are singular. Utf8 will always have 2-4 hex bytes in a row. – Mark Ransom Apr 03 '17 at 15:40
  • No, I'm using Linux. This string I got of an internet page using urllib. The correct decode is: 'arroz e feijão, bife ao molho de tomate, pts com quiabo\r\nSALADA: alface, r\xdacula, rabanete e cebola\r\nSOBREMESA: maçã\r\nSUCO: Amarelo 3\r\n o cardápio contêm glúten no pão. não contêm ovos e lactose. traga sua caneca!'. I tried 'utf-8' and 'ascii'. – hildogjr Apr 03 '17 at 15:55
  • This is latin1. Check this out: https://www.python.org/dev/peps/pep-0223/. `st.decode('latin1')` gives you decoded unicode string. – Dmitry Shilyaev Apr 03 '17 at 15:59
  • I tried this codec too. The only diference is that, when I print the string, star with the "u" indication, but didn't change the answer, still with the "\x**". I could use search "\x**" and change one-by-one with char. But I think that is not the best way. – hildogjr Apr 03 '17 at 22:11
  • I found a solution after check the codecs used in [link](https://pypi.python.org/pypi/chardet>). The code that got correct this HTLM parsed text is `tex = tex.decode('windows-1252').lower().encode('utf-8')` – hildogjr Apr 05 '17 at 02:58

0 Answers0