1

I am trying to scrape all the feed item images from this infinite-scrolling website: https://www.grailed.com/designers/jordan-brand/hi-top-sneakers. I have already seen many other answers to related problems and tried the solutions, and nothing has worked for me so far.
Another question that is very similar: How to handle lazy-loaded images in selenium?

The problem with this website is that unless the feed items are in the viewport, the img tag for the individual items are usually replaced with a div with the class name "lazyload-placeholder". I've tried using selenium to scroll to the bottom of each page and scrape the images, but it has not worked because the images at the top of the page get lazy loaded again.
Another solution that I tried that worked is to scroll down incrementally and scrape after each of these scrolls, however, it took way too long to do this for ~10k elements. I'm wondering if there is a faster/more efficient way to do this?

This is how I'm currently scrolling:

driver.find_element_by_tag_name('body').send_keys(Keys.PAGE_DOWN)

I'm using bs4 to scrape the data from each item: (item is an element in the feed)

try:
    img = item.find('img')['src']
except TypeError:
    img = "N/A"
    print(item)
Jason
  • 47
  • 1
  • 7

1 Answers1

4

Using BeautifulSoup or Selenium for this is way more than what's required, and, as I'm sure you've discovered on your own already, pretty cumbersome to try and use for this particular use-case.

The easier and cleaner thing to do is this: If you open your browser's network traffic logger, and view only the XHR (XmlHttpRequest) requests, you'll see that everytime you scroll down and new products start getting loaded, your browser makes an HTTP POST request to this API: https://mnrwefss2q-dsn.algolia.net/1/indexes/*/queries

If you simply imitate that POST request to that API, using the same query string and POST form data, you can get all the product information you could ever want, including URLs to the product images - and it's all in JSON.

For whatever reason, the API doesn't care about request headers, but that's fine. It's just the query string and the POST form data that it cares about. You can also change the hitsPerPage key-value pair to change the number of products requested. By default it seems to load 40 new products each time, but you can change that number to whatever you want:

def main():

    import requests
    from urllib.parse import urlencode

    url = "https://mnrwefss2q-dsn.algolia.net/1/indexes/*/queries"

    params = {
        "x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser; react (16.13.1); react-instantsearch (6.6.0); JS Helper (3.1.2)",
        "x-algolia-application-id": "MNRWEFSS2Q",
        "x-algolia-api-key": "a3a4de2e05d9e9b463911705fb6323ad"
    }

    post_json = {
        "requests":[
            {
                "indexName": "Listing_production",
                "params": urlencode({
                    "highlightPreTag": "<ais-highlight-0000000000>",
                    "highlightPostTag": "</ais-highlight-0000000000>",
                    "maxValuesPerFacet": "100",
                    "hitsPerPage": "40",
                    "filters": "",
                    "page": "4",
                    "query": "",
                    "facets": "[\"designers.name\",\"category_path\",\"category_size\",\"price_i\",\"condition\",\"location\",\"badges\",\"strata\"]",
                    "tagFilters": "",
                    "facetFilters": "[[\"category_path:footwear.hitop_sneakers\"],[\"designers.name:Jordan Brand\"]]",
                    "numericFilters": "[\"price_i>=0\",\"price_i<=99999\"]"
                })
            }
        ]
    }

    response = requests.post(url, params=params, json=post_json)
    response.raise_for_status()

    results = response.json()["results"]
    items = results[0]["hits"]

    for item in items:
        print(f"{item['title']} - price: ${item['price']}")
        print(f"Image URL: \"{item['cover_photo']['url']}\"\n")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Air Jordan 13 Retro Grey Toe 2014 - price: $150
Image URL: "https://cdn.fs.grailed.com/api/file/HZfvq06fSYOZWvB6OTxA"

Air Jordan 5 Retro Grape 2013 - price: $300
Image URL: "https://cdn.fs.grailed.com/api/file/nwQmfUzITOSVa2Qg5gCt"

Air Jordan 11 BG Legend Blue - price: $243
Image URL: "https://cdn.fs.grailed.com/api/file/AKYdESePSdK1XqDEqYMr"

Air Jordan 12 Retro Cool Grey 2012 - price: $200
Image URL: "https://cdn.fs.grailed.com/api/file/oAl5cxdCSPyCQaXUKPRm"

Air Jordan 11 Retro GS Space Jam 2009 - price: $162
Image URL: "https://cdn.fs.grailed.com/api/file/oRAMFOMTeu9fkWp5l640"

Jordan 1 Flight Mens High Tops Shoes - Size 10.5 White - price: $50
Image URL: "https://cdn.fs.grailed.com/api/file/hRZFggEyRImxzi1T623p"

Air Jordan 1 Retro High OG Royal Toe Sz 8.5 - price: $400
Image URL: "https://cdn.fs.grailed.com/api/file/MWwBxHyNRDCuDZ4Pc2LC"

Air Jordan 14 GS Black Toe 2014 Size 4Y (5.5 Womans) - price: $58
Image URL: "https://cdn.fs.grailed.com/api/file/FzV1GrMGSRqPjV0dPFcB"

Air Jordan 5 Retro Fire Red 2020 - price: $250
Image URL: "https://cdn.fs.grailed.com/api/file/vGda4X6qReBq42muTatG"

Air Jordan 6 Retro All Star Chameleon - price: $70
Image URL: "https://cdn.fs.grailed.com/api/file/wk1ySwZDQqWHtCqvi9I0"

Air Jordan 5 Retro GS Oreo - price: $64
Image URL: "https://cdn.fs.grailed.com/api/file/rWrcrhdiSBG4hvRZ53aS"

Air Jordan 11 Retro GS Bred 2012 - price: $76
Image URL: "https://cdn.fs.grailed.com/api/file/ceUppZDSo6YgvPM6OoMg"

Air Jordan 1 Mid (GS) 6Y (5.5 Uk) - price: $115
Image URL: "https://cdn.fs.grailed.com/api/file/cRFIYqE8TKOSAiKaV2uA"

Air Jordan 7 Retro GS Pure Money 3.5Y - price: $87
Image URL: "https://cdn.fs.grailed.com/api/file/uXZIVZMQQain0pUSyJIe"

Air Jordan 5 Retro Olympic 2011 - price: $120
Image URL: "https://cdn.fs.grailed.com/api/file/D7E4GvaJSiywv3vaz3A5"

J12 Grey/University Blue - price: $117
Image URL: "https://cdn.fs.grailed.com/api/file/0T9GNBSUTDSGQsqrg4IS"

1994 Jordan 1 Bred - price: $801
Image URL: "https://cdn.fs.grailed.com/api/file/rfQ68e8PRDOpKN6zwwyw"

Nike Air Jordan Retro 13 He Got Game HGG - price: $90
Image URL: "https://cdn.fs.grailed.com/api/file/k9N12MAQnKcKk3Vl0dek"

Nike Air Jordan Retro 13 XIII Grey Toe He Got Game Playoff - price: $90
Image URL: "https://cdn.fs.grailed.com/api/file/wo1hKnUQeKt0lioebLgG"

Nike Air Jordan 1 Retro Countdown Pack 2008 Vintage - price: $429
Image URL: "https://cdn.fs.grailed.com/api/file/ktAH2VmbTgaG6zdtzHrw"

Air Jordan 1 Retro High OG Bred Toe - price: $495
Image URL: "https://cdn.fs.grailed.com/api/file/a41NTCSXTgGm7KyqkTqB"

Air Jordan 10 Retro 2018 Orlando 2018 - price: $121
Image URL: "https://cdn.fs.grailed.com/api/file/dMAbTBYVSYSX6KNqVvQA"

Air Jordan 12 Retro Winterized Triple Black 2018 - price: $84
Image URL: "https://cdn.fs.grailed.com/api/file/Jg69Af0QrelcZxWM2OEW"

Air Jordan 6 Retro Diffused Blue 2018 - price: $145
Image URL: "https://cdn.fs.grailed.com/api/file/bVPC1SomTKO3yPLCm0rC"

AIR JORDAN 7 RETRO "2005 CARDINAL" - price: $57
Image URL: "https://cdn.fs.grailed.com/api/file/QzvDCMVRqykY1CnkrMwC"

Air Jordan 5 Retro Camo 2017 - price: $220
Image URL: "https://cdn.fs.grailed.com/api/file/eLDeagz4TF6Yu5aItPAE"

Air Jordan 1 Retro High Grand Purple 2009 - price: $261
Image URL: "https://cdn.fs.grailed.com/api/file/XhhIFIQ5SyjNQkEXpQjK"

Air Jordan 6 Retro 2015 Maroon 2015 - price: $150
Image URL: "https://cdn.fs.grailed.com/api/file/5UTk9ctnTnSsZEixeYwW"

ORIGINAL 1985 WEARABLE Chicago OG Air Jordan 1 Last Dance - price: $2268
Image URL: "https://cdn.fs.grailed.com/api/file/CE4eMehmQvOYtpx16R3X"

Air Jordan 5 Retro Top 3 - price: $247
Image URL: "https://cdn.fs.grailed.com/api/file/B1A8oopSR226r2rmOjfR"

AIR JORDAN 4 METALLIC RED - price: $350
Image URL: "https://cdn.fs.grailed.com/api/file/ZHJVHw7fRC2jO5wDj7JF"

Air Jordan 1 Retro Mid Size 7 GS White Gym Red - price: $67
Image URL: "https://cdn.fs.grailed.com/api/file/XPMpEjlZSmCNikSeHjg4"

Air Jordan 5 Retro Metallic White 2015 Size 13 - price: $57
Image URL: "https://cdn.fs.grailed.com/api/file/LHvMnyxxQaOCyYiTSvHw"

Air Jordan 8 Retro C&C Trophy 2016 - price: $115
Image URL: "https://cdn.fs.grailed.com/api/file/7sh2iU42T6kUKjJaAyE8"

JIRDAN 7 UNIVERSITY BLUE - price: $120
Image URL: "https://cdn.fs.grailed.com/api/file/zJzKkd1QyKvMasw1ZWRB"

AIR JORDAN 3 5LAB3 - price: $70
Image URL: "https://cdn.fs.grailed.com/api/file/IJvMAIeMTPycUk7n0idU"

Jordan 6 rings UNC - price: $189
Image URL: "https://cdn.fs.grailed.com/api/file/nE6A02dKQBa7ZTSJWGu3"

Nike Air Jordan 5 Retro Top 3 - price: $255
Image URL: "https://cdn.fs.grailed.com/api/file/UfcVOOO0QJSM6csHc1cK"

Nike Air Jordan 13 White Pink Soar Aurora Green (GS) - price: $155
Image URL: "https://cdn.fs.grailed.com/api/file/58rVbApeR8K8Ojd1Zskg"

Jordan Thunder 4 Retro 2012 - price: $350
Image URL: "https://cdn.fs.grailed.com/api/file/ZJ1EuskfThGImbij5QkS"

>>> 

EDIT - Here is the updated code which looks at the other API:

def main():

    import requests
    from urllib.parse import urlencode

    url = "https://mnrwefss2q-1.algolianet.com/1/indexes/Listing_production/browse"

    params = {
        "x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser",
        "x-algolia-application-id": "MNRWEFSS2Q",
        "x-algolia-api-key": "a1c6338ffe41249d0284a5a1105eafe4"
    }

    post_json = {
        "params": "query=&" + urlencode({
            "offset": "0",
            "length": "100",
            "facetFilters": "[[\"category_path:footwear.hitop_sneakers\"], [\"designers.name:Jordan Brand\"]]",
            "filters": ""
        })
    }

    response = requests.post(url, params=params, json=post_json)
    response.raise_for_status()

    items = response.json()["hits"]

    # ...

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())
Paul M.
  • 10,481
  • 2
  • 9
  • 15
  • Thank you, this was very helpful. Just one question, I can only get about 25 pages of 40 results each, which is a total of around 1000 items. However the website shows that there are about 22k items in the feed. Do you know how to get beyond 1000 products? – Jason Jul 03 '20 at 20:50
  • 1
    @JasonThomo Thanks for pointing that out! I was curious and I manually scrolled down on the webpage, making sure to see what happens in the network traffic logger once I reached 1000 results. It turns out that once you hit 1000 items, for whatever reason, the webpage starts making POST requests to a different API, using a different payload. This one is actually a bit easier to digest, and you should be able to use it to access all items, not just the ones after 1000. I've updated my post with new code at the bottom. Play around with the `offset` and `length` parameters. – Paul M. Jul 03 '20 at 21:36