EDIT: The possible duplicate does not solve my question because I've tried to also use a headless browser without success. That question does not explain how to use a headless browser to accomplish this or similar task.
I'm scraping this page:
The first 12 products are loaded automatically (not using JS) and then the other (I believe 48?) products are loaded after user scrolls down a bit.
This snippet:
import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
url_list2 = []
data2 = requests.get("https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')
x = soup2.findAll('div', attrs={'class': 'product-card'})
for url2 in x:
get_urls = "https://www.finishline.com"+url2.find('a')['href']
url_list2.append(get_urls)
print(url_list2)
will get the 12 products that are independent of JS (this can be checked by turning off JS in Chrome settings). However, there are 60 (or 59) products on the page when JS is turned on.
How can I get all the products using BS4? I also tried Selenium, however using it I get a different error.
On the Selenium attempt, I managed to get all 59 products shown on the page. I am using this code to get the URLs of all product pages for further scraping.
import requests
import csv
import io
import os
import time
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import DesiredCapabilities
from bs4 import BeautifulSoup,Tag
page = "https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes"
url_list2 = []
page_num = 0
#session = requests.Session()
while page_num <1160:
driver = webdriver.Chrome()
driver.get(page)
getproductUrls = driver.find_elements_by_class_name('product-card')
for url2 in getproductUrls:
get_urls = "https://www.finishline.com"+url2.find_element_by_tag_name('a').get_attribute("href")
url_list2.append(get_urls)
print(url_list2)
driver.close()
page = "https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes&No={}".format(page_num)
page_num +=40
However, after a while, the error
raise exception_class(message, screen, stacktrace, alert_text)
selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: None
Message: unexpected alert open: {Alert text : something went wrong}
occurs, because the site has detected unusual activity. If I were to open the website finishline.com in my browser, I would get an "Access Denied" message and would have to clear my cookies and refresh for it to work again. Obviously, my script doesn't get to finish before this message pops up.
Does anyone know of a solution? Thank you in advance.