1

Part of a website I'm trying to scrape has this weird block of hex values instead of characters. How can I decode this with python?

I am using urllib.request to get the page source

http://www.filmovita.com/despicable-2-2013/ here is the code:

<span id='engimadiv5523da4eae533' data-enigmav='\u0032\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0077\u0077\u0077\u002e\u0063\u006c\u006f\u0075\u0064\u007a\u0069\u006c\u006c\u0061\u002e\u0074\u006f\u002f\u0073\u0068\u0061\u0072\u0065\u002f\u0066\u0069\u006c\u0065\u002f\u0047\u0042\u0047\u0059\u0031\u0030\u0036\u0031\u0053\u004c\u0053\u0032\u004c\u0035\u004b\u004b\u0058\u0030\u0049\u0054\u0054\u0047\u004f\u004c\u0047\u002f\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0043\u006c\u006f\u0075\u0064\u005a\u0069\u006c\u006c\u0061\u002e\u0074\u006f\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0033\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0066\u0069\u006c\u0065\u0068\u006f\u006f\u0074\u002e\u0063\u006f\u006d\u002f\u0068\u0063\u006b\u0074\u0032\u007a\u0072\u0033\u0039\u0079\u0064\u0038\u002e\u0068\u0074\u006d\u006c\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0046\u0069\u006c\u0065\u0048\u006f\u006f\u0074\u002e\u0063\u006f\u006d\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0034\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0077\u0077\u0077\u002e\u0076\u0069\u0064\u0065\u006f\u0077\u006f\u006f\u0064\u002e\u0074\u0076\u002f\u0076\u0069\u0064\u0065\u006f\u002f\u0034\u0033\u006c\u0068\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0056\u0069\u0064\u0065\u006f\u0057\u006f\u006f\u0064\u002e\u0074\u0076\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0035\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0073\u0074\u0072\u0065\u0061\u006d\u0069\u006e\u002e\u0074\u006f\u002f\u0061\u0073\u006b\u0062\u0061\u0031\u0064\u0070\u0078\u0073\u0078\u0063\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0053\u0074\u0072\u0065\u0061\u006d\u0069\u006e\u002e\u0074\u006f\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0036\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0070\u006c\u0061\u0079\u0065\u0064\u002e\u0074\u006f\u002f\u0072\u0076\u006b\u0036\u0034\u0068\u0039\u006d\u006a\u006f\u006e\u0062\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0050\u006c\u0061\u0079\u0065\u0064\u002e\u0074\u006f\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0037\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0077\u0077\u0077\u002e\u0074\u0068\u0065\u0076\u0069\u0064\u0065\u006f\u002e\u006d\u0065\u002f\u0076\u006b\u0074\u0075\u0035\u006a\u006f\u006f\u007a\u0062\u0069\u0064\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0054\u0068\u0065\u0056\u0069\u0064\u0065\u006f\u002e\u006d\u0065\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u002f\u0070\u003e\u000a\u003c\u0070\u003e\u0038\u002e\u0020\u0056\u0065\u0072\u007a\u0069\u006a\u0061\u003a\u003c\u0062\u0072\u0020\u002f\u003e\u000a\u003c\u0073\u0074\u0072\u006f\u006e\u0067\u003e\u003c\u0061\u0020\u0068\u0072\u0065\u0066\u003d\u0022\u0068\u0074\u0074\u0070\u003a\u002f\u002f\u0076\u006f\u0064\u006c\u006f\u0063\u006b\u0065\u0072\u002e\u0063\u006f\u006d\u002f\u006c\u0065\u0068\u0036\u007a\u0064\u0037\u0031\u0075\u0032\u0036\u0063\u0022\u0020\u0074\u0061\u0072\u0067\u0065\u0074\u003d\u0022\u005f\u0062\u006c\u0061\u006e\u006b\u0022\u0020\u0072\u0065\u006c\u003d\u0022\u006e\u006f\u0066\u006f\u006c\u006c\u006f\u0077\u0022\u003e\u0047\u006c\u0065\u0064\u0061\u006a\u0020\u006e\u0061\u0020\u0076\u0069\u0064\u0065\u006f\u0020\u0073\u0065\u0072\u0076\u0069\u0073\u0075\u0020\u0056\u006f\u0064\u004c\u006f\u0063\u006b\u0065\u0072\u002e\u0063\u006f\u006d\u003c\u002f\u0061\u003e\u003c\u002f\u0073\u0074\u0072\u006f\u006e\u0067\u003e' data-enigmad='n'></span>

Thanks!

Natko Kraševac
  • 111
  • 1
  • 2
  • 8
  • Which module are you using? – Ajay Apr 07 '15 at 18:51
  • I edited the question, I am using only urllib.request, but I tried with beautiful soup – Natko Kraševac Apr 07 '15 at 18:55
  • Is it an english website or other language – Ajay Apr 07 '15 at 19:03
  • it is in croatian, but charset is utf-8 ... Other characters are decoded fine , I think this is some kind of their encryption because it contains links to movies. But when I inspect the element with Chrome, this block is decoded okay and I can see the text. I would like to be able to decode it via python too. You can see the decoded text here http://ddecode.com/hexdecoder/?results=571d6c87a87c9fb183cce9813e963c48 – Natko Kraševac Apr 07 '15 at 19:10
  • well try beautifulsoup instead of urllib.. – Ajay Apr 07 '15 at 20:02
  • I tried, but still the same story – Natko Kraševac Apr 07 '15 at 20:15
  • @NatkoKraševac Please don't just send the link to Ajay's address. Instead update the question with the link so that anyone reading the question can help. – three_pineapples Apr 08 '15 at 11:51
  • Possible duplicate of [Convert XML/HTML Entities into Unicode String in Python](http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python) – Paul Sweatte Dec 06 '16 at 21:28
  • @PaulSweatte: these are **not** HTML entities. – Martijn Pieters Dec 06 '16 at 21:56
  • The URL source does *not* contain this span. Are you sure it isn't loaded with AJAX? I can't even find it in the rendered DOM. – Martijn Pieters Dec 06 '16 at 21:57
  • `\u` is the token for unicode whereas `\x` is the token for hex. For example: try this in the browser console: `'\x22' + '\u0022'`. – Paul Sweatte Dec 06 '16 at 23:03

1 Answers1

3

You have a Javascript Unicode escape sequence, not unlike the same escape sequences used in Python.

Decode the sequence as JSON; JSON is a subset of JavaScript and a JSON parser knows how to deal with the peculiarities of the JS string literal format (like handling UTF-16 surrogate pairs).

Use BeautifulSoup first to parse out the element, then JSON-decode the string value (adding in quotes first):

from bs4 import BeautifulSoup
import json

soup = BeautifulSoup(response.content, 'lxml')
span = soup.find(id='engimadiv5523da4eae533')
jsdata = span['data-enigmav']
data = json.loads('"{}"'.format(jsdata))

Demo:

>>> span = soup.find('span')
>>> jsdata = span['data-enigmav']
>>> data = json.loads('"{}"'.format(jsdata))
>>> print(data)
2. Verzija:<br />
<strong><a href="http://www.cloudzilla.to/share/file/GBGY1061SLS2L5KKX0ITTGOLG/" target="_blank" rel="nofollow">Gledaj na video servisu CloudZilla.to</a></strong></p>
<p>3. Verzija:<br />
<strong><a href="http://filehoot.com/hckt2zr39yd8.html" target="_blank" rel="nofollow">Gledaj na video servisu FileHoot.com</a></strong></p>
<p>4. Verzija:<br />
<strong><a href="http://www.videowood.tv/video/43lh" target="_blank" rel="nofollow">Gledaj na video servisu VideoWood.tv</a></strong></p>
<p>5. Verzija:<br />
<strong><a href="http://streamin.to/askba1dpxsxc" target="_blank" rel="nofollow">Gledaj na video servisu Streamin.to</a></strong></p>
<p>6. Verzija:<br />
<strong><a href="http://played.to/rvk64h9mjonb" target="_blank" rel="nofollow">Gledaj na video servisu Played.to</a></strong></p>
<p>7. Verzija:<br />
<strong><a href="http://www.thevideo.me/vktu5joozbid" target="_blank" rel="nofollow">Gledaj na video servisu TheVideo.me</a></strong></p>
<p>8. Verzija:<br />
<strong><a href="http://vodlocker.com/leh6zd71u26c" target="_blank" rel="nofollow">Gledaj na video servisu VodLocker.com</a></strong>
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343