-1

I am quite new to this, so apologies for any inconsistencies or information missing.

I am currently trying to pull information from a specific piece of JSON on a webpage, and having absolute nightmares trying.

Currently, I have this very simple script:

import requests

r = requests.get('https://www.johnlewis.com/canon-pixma-ts5151-all-in-one-wireless-wi-fi-printer-white/p3341066')

print (r.json)

I am trying to get it to print the JSON for the page, but I can't even get it to do this!

Eventually, I am trying to parse the entirety of JSON script # 111, so that I can then pull specific information from this.

How exactly can I go about this? (Either printing the JSON for the whole page, or the JSON of script # 111.)

martineau
  • 119,623
  • 25
  • 170
  • 301
  • 1
    When I look at that page, I don't see any JSON. Where is the data you are trying to retrieive? – Cargo23 Dec 07 '21 at 16:08
  • @Cargo23 Hi There! thanks for your response. This may be my problem then, when I open the console for the page i come across this block " – Quaide Watton Dec 07 '21 at 16:12
  • @Cargo23 ***ADDED*** Noticed i've potentially answered my own question above, it seems to be in dict format, which can be converted to JSON, rather than it being the other way around. How exactly could I extract a variable or information from dict under that name? – Quaide Watton Dec 07 '21 at 16:19
  • See https://stackoverflow.com/questions/61217541/how-to-extract-json-from-script-tag-using-beautiful-soup-python – ChrisOram Dec 07 '21 at 16:20
  • `r.json` is a method, not a property. you're using it completely wrong. It should be `r.json()`. – rv.kvetch Dec 07 '21 at 20:45

1 Answers1

0

I like pyquery for dealing with HTML source code. It works very much like jQuery, if you're familiar with that.

Note: In my tests, I needed to spoof the user agent in order to get a response from the server, YMMV.

import json
import requests
from pyquery import PyQuery

url = 'https://www.johnlewis.com/canon-pixma-ts5151-all-in-one-wireless-wi-fi-printer-white/p3341066'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
pq = PyQuery(response.text)

data = (pq
    .find('script[type="application/ld+json"]')
    .map(lambda i, elem: json.loads(elem.text))
)

On the page you mention there are multiple multiple 'application/ld+json' script elements. The above code finds them all and parses their contents from JSON, so data is a list with multiple elements.

If you know exactly what you're looking for, you can go there directly.

print(data[0]['offers']['seller']['name'])  # => John Lewis & Partners

Alternatively, you can use jsonpath-ng for greatly extended flexibility in querying the extracted objects:

from jsonpath_ng.ext import parse as jp

names = [r.value for r in jp('[*].offers.seller.name').find(data)]
for name in names:
    print(name) 

# => John Lewis & Partners
    
products = [r.value for r in jp('$[?"@type"="product"]').find(data)]
for product in products:
    print(product['sku'], '-', product['name']) 

# => 237161659 - Canon PIXMA TS5151 All-in-One Wireless Wi-Fi Printer, White
Tomalak
  • 332,285
  • 67
  • 532
  • 628