1

I have a URL that links to a javascript file, for example http://something.com/../x.js. I need to extract a variable from x.js

Is it possible to do this using python? At the moment I am using urllib2.urlopen() but when I use .read() I get this lovely mess:

U�(��%y�d�<�!���P��&Y��iX���O�������<Xy�CH{]^7e� �K�\�͌h��,U(9\ni�A ��2dp}�9���t�<M�M,u�N��h�bʄ�uV�\��0�A1��Q�.)�A��XNc��$"SkD�y����5�)�B�t9�):�^6��`(���d��hH=9D5wwK'�E�j%�]U~��0U�~ʻ��)�pj��aA�?;n�px`�r�/8<?;�t��z�{��n��W
�s�������h8����i�߸#}���}&�M�K�y��h�z�6,�Xc��!:'D|�s��,�g$�Y��H�T^#`r����f����tB��7��X�%�.X\��M9V[Z�Yl�LZ[ZM�F���`D�=ޘ5�A�0�){Ce�L*�k���������5����"�A��Y�}���t��X�(�O�̓�[�{���T�V��?:�s�i���ڶ�8m��6b��d$��j}��u�D&RL�[0>~x�jچ7�

When I look in the dev tools to see the DOM, the only thing in the body is a string wrapped in tags. In the regular view that string is a json element.

msturdy
  • 10,479
  • 11
  • 41
  • 52
EasilyBaffled
  • 3,822
  • 10
  • 50
  • 87

2 Answers2

4

.read() should give you the same thing you see in the "view source" window of your browser, so something's wrong. It looks like the HTTP response might be gzipped, but urllib2 doesn't support gzip. urllib2 also doesn't request gzipped data, so if this is the problem, the server is probably misconfigured, but I'm assuming that's out of your control.

I suggest using requests instead. requests automatically decompresses gzip-encoded responses, so it should solve this problem for you.

import requests
r = requests.get('https://something.com/x.js')
r.text   # unparsed json output, shouldn't be garbled
r.json() # parses json and returns a dictionary

In general, requests is much easier to use than urllib2 so I suggest using it everywhere, unless you absolutely must stick to the standard library.

Community
  • 1
  • 1
sjy
  • 2,702
  • 1
  • 21
  • 22
  • So its almost there. r.text gets me the string but r.json() fails with 'ValueError: No JSON object could be decoded' and r.text.json() fails with 'AttributeError: 'unicode' object has no attribute 'json' ' – EasilyBaffled Mar 13 '14 at 02:18
  • Any chance you could share the _actual_ url - it would make this much easier to troubleshoot! – Hugh Bothwell Mar 13 '14 at 02:24
  • Probably your URL returns JavaScript like `var data = {foo: "bar"}` rather than raw JSON. In that case, you'll need to do some string manipulation to get out the JSON before parsing it with `json.loads(s)`. That could be as simple as `s = r.text[s.find("{"):s.find("}")+1]`, if the JSON object is the first use of `{}` in the document, but it could be much more complex. JSON is also slightly stricter than actual JavaScript object notation — for example, you'll run into issues if the original JavaScript used `'` instead of `"`. – sjy Mar 13 '14 at 02:59
0
import json

js = urllib2.urlopen("http://something.com/../x.js").read()
data = json.loads(js)
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99