0

I am unable to retrieve product data I need from a website. I can see the HTML sections that I think I need to grab but my code returns no data. It works for certain HTML tags on that same page but not the one that I want.

I am a real beginner. I have watched youtube videos and tried to go through the questions/responses here. And from what I can tell it seems like the data I need from the website may be something other than html but embedded in the html(?).

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url='https://www.harristeeter.com/specials/weekly-list/best-deals'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
len(page_soup.findAll("div",{"class":"product_infoBox"}))
len(page_soup.findAll("div",{"class":"container"}))

In the code I can retrieve results for "container" (=5) but not "product_infoBox" (=0). "product_infoBox" is the section I need.

1 Answers1

0

The page loads data dynamically via JSON, but you can obtain this data through requests as well. This script searches for a store, select first result and loads weekly specials:

import requests
from bs4 import BeautifulSoup
import json

store_search_url = 'https://www.harristeeter.com/api/v1/stores/search?Address={}&Radius=10000&AllStores=true&NewOrdering=false&OnlyPharmacy=false&OnlyFreshFood=false&FreshFoodOrdering=undefined'
weekly_specials_url = 'https://www.harristeeter.com/api/v1/stores/{}/departments/0/weekly_specials?'

headers = {'Referer': 'https://www.harristeeter.com/store-locator'}

with requests.session() as s:
    r = s.get('https://www.harristeeter.com/store-locator', headers=headers)
    store_search_data = s.get(store_search_url.format('pine ridge plaza, reynolda road'), headers=headers).json()

    # This prints all results from store search:
    # print(json.dumps(store_search_data, indent=4))

    # we select the first match:
    store_number = store_search_data['Data'][0]['Number']
    weekly_specials_data = s.get(weekly_specials_url.format(store_number), headers=headers).json()

    print(json.dumps(weekly_specials_data, indent=4))

Prints:

{
    "Status": "success",
    "Data": [
        {
            "ID": "4615146",
            "AdWeek": "2019-07-16",
            "DepartmentNumber": "4",
            "AdWeekExpires": "07/16/2019",
            "ActiveAdWeekRelease": "2019-07-16",
            "StartDate": "7/10/2019",
            "EndDate": "7/16/2019",
            "IsCardRequired": true,
            "Title": "Harris Teeter Cottage Cheese, Sour Cream, French",
            "Description": "8-16 oz",
            "Detail": "e-VIC Member Price $1.27",
            "Price": "2/$3",
            "SpecialPrice": "$1.27",
            "DesktopImageUrl": "https://23360934715048b8b9a2-b55d76cb69f0e86ca2d9837472129d5a.ssl.cf1.rackcdn.com/sm_4615146.jpg",
            "MobileImageUrl": "https://23360934715048b8b9a2-b55d76cb69f0e86ca2d9837472129d5a.ssl.cf1.rackcdn.com/sm_4615146.jpg",
            "Limit": "6",
            "Savings": "Save at Least 38\u00a2 on 2",
            "Size": "8-16 oz",
            "Subtitle": "Limit 6 at e-VIC Price",
            "IsAdded": false,
            "RetinaImageUrl": "https://23360934715048b8b9a2-b55d76cb69f0e86ca2d9837472129d5a.ssl.cf1.rackcdn.com/4615146.jpg",
            "TIE": "1",
            "Organic": "0",
            "Type": "EVIC",
            "DepartmentName": "Dairy & Chilled Foods"
        },
        {
            "ID": "4614507",

... and so on.
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This output is exactly like I was hoping. This may be dumb beginner issue but....when I run it on my system I get: SSLError: ("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",). – Lee Miller Jul 15 '19 at 22:23
  • @LeeMiller Add `verify=False` to `requests.get()` calls. More info here: https://stackoverflow.com/questions/15445981/how-do-i-disable-the-security-certificate-check-in-python-requests – Andrej Kesely Jul 16 '19 at 04:00
  • If it is not too late, can I ask 1 follow up question - how did you determine what URLs to use for store_search and weekly_specials? – Lee Miller Jul 16 '19 at 18:47
  • @LeeMiller I looked into Firefox developer tools (or Chrome if you prefer) and watched where the page is making requests, to which URLs. – Andrej Kesely Jul 16 '19 at 18:48