0

EDIT: The possible duplicate does not solve my question because I've tried to also use a headless browser without success. That question does not explain how to use a headless browser to accomplish this or similar task.

I'm scraping this page:

https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes

The first 12 products are loaded automatically (not using JS) and then the other (I believe 48?) products are loaded after user scrolls down a bit.

This snippet:

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    }

url_list2 = []


data2 = requests.get("https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')

x = soup2.findAll('div', attrs={'class': 'product-card'})
for url2 in x:
    get_urls = "https://www.finishline.com"+url2.find('a')['href']
    url_list2.append(get_urls)
print(url_list2)

will get the 12 products that are independent of JS (this can be checked by turning off JS in Chrome settings). However, there are 60 (or 59) products on the page when JS is turned on.

How can I get all the products using BS4? I also tried Selenium, however using it I get a different error.

On the Selenium attempt, I managed to get all 59 products shown on the page. I am using this code to get the URLs of all product pages for further scraping.

import requests
import csv
import io
import os
import time
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import DesiredCapabilities
from bs4 import BeautifulSoup,Tag

page = "https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes"

url_list2 = []

page_num = 0
#session = requests.Session()
while page_num <1160:
    driver = webdriver.Chrome()
    driver.get(page)
    getproductUrls = driver.find_elements_by_class_name('product-card')
    for url2 in getproductUrls:
        get_urls = "https://www.finishline.com"+url2.find_element_by_tag_name('a').get_attribute("href")
        url_list2.append(get_urls)
        print(url_list2)
    driver.close()

    page = "https://www.finishline.com/store/men/shoes/_/N-1737dkj?mnid=men_shoes#/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?mnid=men_shoes&No={}".format(page_num)
    page_num +=40

However, after a while, the error

raise exception_class(message, screen, stacktrace, alert_text)
selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: None
Message: unexpected alert open: {Alert text : something went wrong}

occurs, because the site has detected unusual activity. If I were to open the website finishline.com in my browser, I would get an "Access Denied" message and would have to clear my cookies and refresh for it to work again. Obviously, my script doesn't get to finish before this message pops up.

Does anyone know of a solution? Thank you in advance.

kamen1111
  • 175
  • 10
  • Possible duplicate of [web scraping dynamic content with python](https://stackoverflow.com/questions/17608572/web-scraping-dynamic-content-with-python) – ivan_pozdeev Apr 12 '19 at 20:01

1 Answers1

1

The content are available in the page source. You can't fetch all of them using requests only because most of them are within script tag. Moreover, you need to find the appropriate url which you can make use of to traverse multiple pages. This is the right one which you can grab using chrome dev tools. Currently the following script can grab 120 products. You can change the range to your preference.

This is how you can go:

import requests
from bs4 import BeautifulSoup

url = "https://www.finishline.com/store/men/shoes/nike/adidas/jordan/under-armour/puma/new-balance/reebok/champion/timberland/fila/lacoste/converse/_/N-1737dkjZhtjl46Zh51uarZvnhst2Zu4e113Z16ggje2Z1alnhbgZ1lzobj2Z7oi4waZ1hzyzukZm0ym0nZj4k440Zdshbsy?"

qsp = {
    'mnid': 'men_shoes_nike_adidas_jordan_underarmour_puma_newbalance_reebok_champion_timberland_fila_lacoste_converse',
    'No': 0,
    'isAjax': True
}


container = []

for page_content in range(0,120,40):
    qsp['No'] = page_content
    res = requests.get(url,params=qsp,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text, 'lxml')
    for item in soup.select(".product-card__details .product-name"):
        container.append(item.get_text(strip=True))

    for items in soup.select("script"):
        sauce = BeautifulSoup(items.text,"lxml")
        for elem in sauce.select(".product-card__details .product-name"):
            container.append(elem.get_text(strip=True))

for product in container:
    print(product)

Btw, I can see 40 products in each page. Perhaps the number of products in each page differs according to country. Change it range(0,120,40) to how many you can see in each page from your end.

SIM
  • 21,997
  • 5
  • 37
  • 109
  • Thank you very much! I'm sorry for responding so late. I was able to use this code to achieve exactly what I wanted! Could you please explain how you grabbed the URL via Chrome Dev tools? Did you construct the URL yourself? – kamen1111 Apr 17 '19 at 12:16