0

I want to create a list of all of the diamonds' URLs in the table on Blue Nile, which should be ~142K entries. I noticed that I had to scroll to load more entries so the first solution I implemented for that is to scroll to the end of the page first before scraping. However, the max number of elements scraped would only be 1000. I learned that this is due to issues outlined in this question: Selenium find_elements_by_id() doesn't return all elements but the solutions aren't clear and straightforward for me.

I tried to scroll the page by a certain amount and scrape until the page has reached the end. However, I can only seem to get the initial 50 unique elements.

driver = webdriver.Chrome()
driver.get("https://www.bluenile.com/diamond-search?pt=setform")
source_site = 'www.bluenile.com'
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
print(last_height)
new_height = 500
diamond_urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
count = 0

while new_height < last_height:
    for url in soup.find_all('a', class_='grid-row row TL511DiaStrikePrice', href=True):
        full_url = source_site + url['href'][1:]
        diamond_urls.append(full_url)
        count += 1
    if count == 50:
        driver.execute_script("window.scrollBy(0, 500);")
        time.sleep(SCROLL_PAUSE_TIME)
        new_height+=500
        print(new_height)
        count = 0

Please help me find the issue with my code above or suggest a better solution. Thanks!

jko0401
  • 47
  • 7
  • actually that page has only 1000 rows i can guess from the scroll bar size . just try the `pagination` approach which is suggested in the answer – Sowjanya R Bhat Jun 09 '20 at 09:16

1 Answers1

1

As a simpler solution I would just query their API (Sample below):

https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=50&_=1591689344542&unlimitedPaging=false&sortDirection=asc&sortColumn=default&shape=RD&maxDateType=MANUFACTURING_REQUIRED&isQuickShip=false&hasVisualization=false&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN

One of the response parameters of this endpoint is the countRaw which is 100876. Therefore it should be simple enough to iterate over in blocks of 50 (or more you just don't want to abuse the endpoint) until you have all the data you need.

Hope this helps.

cullzie
  • 2,705
  • 2
  • 16
  • 21
  • @jko0401 Seems like a bug on their side. If you set the filter to 24 items per page on the UI and go to the last page you get a 400 response in the network tab. Anything over 1000 seems to be broken – cullzie Jun 10 '20 at 08:45
  • bummer. my work around right now is to select each of the filters individually to minimize the number of items per API call but some combinations are still a little over 1000. Plus, it seems like there's a limit on the number of calls per minute my script takes super long to get all items – jko0401 Jun 10 '20 at 16:31