0

I am getting a str from BeautifulSoup that contains escaped characters using \xXX notation that needs to be decoded into a regular str.

Example:

next_url = r'\x26hl\x3den'

After conversion, I want:

next_url = '&hl=en'

It appeared simple at first, but I have not been able to find a solution after an hour of search. What is a good way to do it?

EDIT: adding some code in response to the comments. It is really simple.

session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
next_url = soup.find(class_='XXXX')['onclick'].split('=', 1)[1][1:-1] # handles: onclick="window.location='http:domain.com/path'"

next_url needs to be decoded.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
r.v
  • 4,697
  • 6
  • 35
  • 57
  • 1
    You mean you have a str that contains Python style byte literals (escape sequences)? – Alastair McCormack Jun 26 '18 at 21:33
  • Exactly, you got it right. – r.v Jun 26 '18 at 21:34
  • Sounds strange. Where are you reading the data from? – Alastair McCormack Jun 26 '18 at 21:35
  • I am scraping some websites with beautiful soup, URLs are embedded in javascript with these hex sequences, and I need to unescape them to be able to use these URLs. – r.v Jun 26 '18 at 21:36
  • https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python – Finlay McWalter Jun 26 '18 at 21:37
  • 1
    I doubt the actual data contains Python byte literals. I'm guessing you've done a `str()` somewhere. It's best to fix your root cause than patch around it. Please provide your code. (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – Alastair McCormack Jun 26 '18 at 21:38
  • The website data does not have anything related to python, but the escape sequences are a general standard and can be decoded in any language. – r.v Jun 26 '18 at 21:40
  • A web browser does not know how to decode Python style (`\x`) byte literals. Show your code and the website you're crawling from. – Alastair McCormack Jun 26 '18 at 21:44
  • See the code in the edit. Note that what you are calling python style is also javascript style and it just works in the browser – r.v Jun 26 '18 at 21:46
  • 1
    `'\x26hl\x3den'` is already `&hl=en`. – user2357112 Jun 26 '18 at 21:50
  • @user2357112 your comment is true only when the string is embedded in python code (I believe that the python parser will do the conversion) but not true if it is coming as an input – r.v Jun 26 '18 at 21:53
  • @r.v ah, ok - it's something in the JS you want to read. Please provide output of `repr(next_url)` – Alastair McCormack Jun 26 '18 at 21:54
  • Are there quotes around the output or repr? – Alastair McCormack Jun 26 '18 at 22:01
  • Yes, single quotes. This is a `str` – r.v Jun 26 '18 at 22:02
  • To write a function `decode` such that `decode('\x26hl\x3den') == '&hl=en'`, you'd just do `def decode(string): return string`. If you read a string with backslashes in it from a web request, that's very different from writing `decode('\x26hl\x3den')`. – user2357112 Jun 26 '18 at 22:04
  • You sure? A str should have double quotes. Please paste as it's shown – Alastair McCormack Jun 26 '18 at 22:05
  • Also, you say you're decoding a bytestring in your title, but then in the body, you say it's not `bytes`. – user2357112 Jun 26 '18 at 22:06
  • 1
    @user2357112 judging by the example given later, he probably meant `r'\x26hl\x3den'` - notice the `r`. – Mark Ransom Jun 26 '18 at 22:06
  • @MarkRansom is right – r.v Jun 26 '18 at 22:07
  • 1
    @AlastairMcCormack if you `print(repr(s))` there is only a single quote around the content of `s`. Of course `repr(s)` itself is a string so you would get double quotes if just writing it on the interactive REPL. For example, `>>> print(repr('hello'))` gives me `'hello'`. – r.v Jun 26 '18 at 22:09
  • Apologies, I was comparing to an interactive session. – Alastair McCormack Jun 26 '18 at 22:12

1 Answers1

2

You have a str with byte literals. Use codecs module with unicode-escape codec to unescape them.

import codecs
codecs.decode(r'\x26hl\x3den', 'unicode-escape')
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100