0

My goal: On the AptDeco website (url in code below) there are links to 60 pieces of furniture. I want to scrape all 60 of those links. My solution is to: (i) create a selenium driver, (2) laod the AptDeco webpage on that driver, (3) pull the HTML code from the loaded webpage into beautiful soup, (4) extract all the HTML links from beautiful soup (see code below)

My issue: the HTML source code I am downloading to the variable named "html_page" only includes the first 6 pieces of furniture. I can re-create the issue manually. If I go to the url in my browser, right click and select "view page source" I see HTML source code that only includes links to the first 6 items. If I go to the url in my browser, right click and select "inspect", I see HTML source code that includes links to all 60 items. Is there a way to write a piece of code that pulls the HTML code as it appears in the "inspect" version rather than the "view page source" version? My hypothesis if that the website is dynamic, and there is a piece of JavaScript that has been executed in the "inspect" HTML version but not in the "view page source" version, but I'm unsure how to get the version I want.

Edit: It was pointed out that perhaps I needed to wait for Ajax content to load. I ran a couple of tests after I loaded the url to confirm this isn't the issue. First, I checked to see if there were jQuery's still active (raised an Exception, there was no jQuery). Second, I checked that the document.readyState is complete. After these two tests, I ran the "html_page = driver.page_source" line of code and found I was still getting the same issue.

from selenium import webdriver
from bs4 import BeautifulSoup

url = 'https://www.aptdeco.com/catalog'
driver = webdriver.Chrome()
driver.get(url)
html_page = driver.page_source
soup = BeautifulSoup(html_page, "html.parser")
for link in soup.findAll('a', class_='Card__CardLink-rr6223-1 crcHwb'):
    print(link.get('href'))
Kathryn
  • 11
  • 2
  • The links are probably added to the page via javascript after the page loads – ibrahim mahrir Nov 06 '19 at 16:07
  • The script with the id __NEXT_DATA__ contains all the data as Json. – NineBerry Nov 06 '19 at 16:09
  • @ibrahimmahrir this is helpful advice, and my gut says you're probably right. Do you know of any way for me to access the HTML code of a website after the javascript has been executed, such that the links have been added? – Kathryn Nov 06 '19 at 19:37
  • @David the issue I was trying to solve by using selenium is that the website is dynamic, so just using BS pulled an HTML shell that didn't have the links dynamically filled in yet. Turns out using selenium didn't solve that issue at all. I'm still facing that same issue. – Kathryn Nov 06 '19 at 19:39
  • @NineBerry good point, the NEXT__DATA script does contain what I asked for in this post. Unfortunately, my ultimate goal is a bit broader. After I solve this piece, I want to load the next 60 items and scrape their data. To my major disappointment, the NEXT__DATA doesn't update automatically when I load more data. Before I realized that, I actually built out a whole script to transform this NEXT__DATA element into a table. If you know of a way to get the NEXT__DATA piece of code to update to show the next 60 elements, that would be majorly useful to me and save me a huge headache! – Kathryn Nov 06 '19 at 19:44
  • https://stackoverflow.com/a/43565160/9867451 – ibrahim mahrir Nov 06 '19 at 20:26

0 Answers0