I like pyquery for dealing with HTML source code. It works very much like jQuery, if you're familiar with that.
Note: In my tests, I needed to spoof the user agent in order to get a response from the server, YMMV.
import json
import requests
from pyquery import PyQuery
url = 'https://www.johnlewis.com/canon-pixma-ts5151-all-in-one-wireless-wi-fi-printer-white/p3341066'
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
})
pq = PyQuery(response.text)
data = (pq
.find('script[type="application/ld+json"]')
.map(lambda i, elem: json.loads(elem.text))
)
On the page you mention there are multiple multiple 'application/ld+json'
script elements. The above code finds them all and parses their contents from JSON, so data
is a list with multiple elements.
If you know exactly what you're looking for, you can go there directly.
print(data[0]['offers']['seller']['name']) # => John Lewis & Partners
Alternatively, you can use jsonpath-ng for greatly extended flexibility in querying the extracted objects:
from jsonpath_ng.ext import parse as jp
names = [r.value for r in jp('[*].offers.seller.name').find(data)]
for name in names:
print(name)
# => John Lewis & Partners
products = [r.value for r in jp('$[?"@type"="product"]').find(data)]
for product in products:
print(product['sku'], '-', product['name'])
# => 237161659 - Canon PIXMA TS5151 All-in-One Wireless Wi-Fi Printer, White