0

When I do this:

s = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]').extract()

what I get back is:

<meta content='{"hosting":{"id":2256573,"offset_lat":39.04258923718809,"offset_lng":-95.69083697887662},"map_url":"https://maps.googleapis.com/maps/api/staticmap?markers=%2C&amp;size&amp;zoom=14","place_recommendations":[],"neighborhood_breadcrumb_details":[{"link_text":"Southwest Fillmore Street,","search_text":"Southwest Fillmore Street Topeka, KS","link":"&lt;span&gt;Southwest Fillmore Street,&lt;/span&gt;","link_route":"/s/Southwest-Fillmore-Street-Topeka--KS"},{"link_text":"Topeka,","search_text":"Topeka, KS","link":"&lt;span&gt;Topeka,&lt;/span&gt;","link_route":"/s/Topeka--KS"},{"link_text":"Kansas,","search_text":"Kansas, United States","link":"&lt;span&gt;Kansas,&lt;/span&gt;","link_route":"/s/Kansas--United-States"},{"link_text":"United States","search_text":"United States","link":"&lt;span&gt;United States&lt;/span&gt;","link_route":"/s/United-States"}],"neighborhood_basic_info":null,"neighborhood_localized_name":null,"user_info":{"user_image":"&lt;img alt=\"Elizabeth\" data-pin-nopin=\"true\" height=\"90\" src=\"https://a0.muscache.com/im/users/9199018/profile_pic/1380782460/original.jpg?aki_policy=profile_x_medium\" title=\"Elizabeth\" width=\"90\" /&gt;"}}' id="_bootstrap-neighborhood_card">

Which is clearly JSON but it's encoded (as you can see). I tried urllib.unquote but that throws an error. AttributeError: 'list' object has no attribute 'split'

I was hoping to not have to resort to using a regex to do the URL decoding. What can I do (besides using a regex) to make this valid JSON?

Jeroen
  • 460
  • 6
  • 14
  • 3
    That's not URL encoded. URL encoding look like `This%20is%20a%20test`. What you have there is HTML using HTML character entities like `<` for `<`. And there are already lots of good answers on how to deal with that. – larsks Mar 05 '16 at 02:17
  • 1
    Possible duplicate of [Decode HTML entities in Python string?](http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string) – larsks Mar 05 '16 at 02:17
  • Did you try `eval('('+String+')')`? Those are HTML Entities. Your problem is that is a String. You need to make it into code. – StackSlave Mar 05 '16 at 02:44
  • @larsks you are right, I reworded the title. – Jeroen Mar 05 '16 at 02:58

2 Answers2

2

Get the value of the content attribute and load it via json.loads():

>>> import json
>>> content = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]/@content').extract_first()
>>> json.loads(content)

Note that you also need to use extract_first() instead of extract() to get a string value and not a list.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

You can decode using json.loads(), however, you need to get at the JSON string contained in the content attribute of <meta> tag.

You can make multiple calls to xpath() to drill into the attributes of the selected tag:

meta = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]')
content = meta.xpath('@content').extract_first()
data = json.loads(content)

Or you can do it in one go:

content = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]').xpath('@content').extract_first()
data = json.loads(content)
from pprint import pprint
pprint(data)

Output

{u'hosting': {u'id': 2256573,
              u'offset_lat': 39.04258923718809,
              u'offset_lng': -95.69083697887662},
 u'map_url': u'https://maps.googleapis.com/maps/api/staticmap?markers=%2C&size&zoom=14',
 u'neighborhood_basic_info': None,
 u'neighborhood_breadcrumb_details': [{u'link': u'Southwest Fillmore Street,',
                                       u'link_route': u'/s/Southwest-Fillmore-Street-Topeka--KS',
                                       u'link_text': u'Southwest Fillmore Street,',
                                       u'search_text': u'Southwest Fillmore Street Topeka, KS'},
                                      {u'link': u'Topeka,',
                                       u'link_route': u'/s/Topeka--KS',
                                       u'link_text': u'Topeka,',
                                       u'search_text': u'Topeka, KS'},
                                      {u'link': u'Kansas,',
                                       u'link_route': u'/s/Kansas--United-States',
                                       u'link_text': u'Kansas,',
                                       u'search_text': u'Kansas, United States'},
                                      {u'link': u'United States',
                                       u'link_route': u'/s/United-States',
                                       u'link_text': u'United States',
                                       u'search_text': u'United States'}],
 u'neighborhood_localized_name': None,
 u'place_recommendations': [],
 u'user_info': {u'user_image': u''}}
mhawke
  • 84,695
  • 9
  • 117
  • 138
  • Unrelated, but what is the difference between print and pprint? (relative newbie here) – Jeroen Mar 05 '16 at 04:31
  • `pprint` == "pretty print". For nested data structures `pprint` produces a more readable output than `print`. – mhawke Mar 05 '16 at 04:34
  • @Jeroen: thanks for accepting this answer, although I do think that alecxe's single xpath query is more concise: `content = response.xpath('//meta[@id="_bootstrap-neighborhood_card"]/@content').extract_first()`. – mhawke Mar 05 '16 at 05:47
  • yeah it is more concise, I agree, but beyond that, it's essentially the same so I looked at you guys's reputation score and yours is lower... :-) – Jeroen Mar 05 '16 at 10:35
  • @Jeroen: I'm grateful but that's not a reason to accept an answer. You should choose the one that best answers your question. If that's mine then great. Your selection may also affect future readers that might have a similar problem. On that basis I'm inclined to update the second part of my answer to be the same as alecxe's. – mhawke Mar 05 '16 at 10:52
  • Actually @mhawke, another reason was that your answer is more verbose. As seasoned secs we may select the more concise code, for future readers yours explains what is happening slightly better. At least IMHO... – Jeroen Mar 05 '16 at 10:55
  • @Jeroen: OK then. alecxe's answer is mentioned in the comments, so I suppose that's OK. Thanks. – mhawke Mar 05 '16 at 10:56
  • Secs = devs... Stupid iPhone! – Jeroen Mar 05 '16 at 11:10