You are missing ?
before currPage, it should look like that: https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}
.?
indicates a start of the query string. Now your code will work.
You also can omit page 0
, because this site starts pagination from 1
and providing 0
gives 404 Page not found
. Besides that, you don't need while True
because you want to execute this block of code only once. For
loop takes care of changing pages and it is enough.
There is a bug here:
for s in keywords:
if s not in link.a["href"]:
found = False
break
you break from the loop if a keyword is not in link.a['href']
. Notice that if the first keyword
from your list is not there, it doesn't mean one of the next ones won't.
Your code after a few fixes:
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"
for page in range(1, 11):
print(f'[PAGE NR {page}]')
url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
all_links = soup.find_all("li", attrs={"class": "product in-stock"})
for link in all_links:
if any(key in link.a["href"] for key in keywords):
print("Product found.")
print(base_url + link.a["href"])
print(link.img["title"])
print(link.div.div.ul.text)
Here is my version of the code, I used .select()
instead of .find_all()
. This is better, because if the creators of the page will add some new classes to the elements you search for, .select()
that uses CSS selectors will still be able to target these elements. I also used urljoin
to create absolute links, see here why.
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"
for page in range(1, 11):
print(f'[PAGE NR {page}]')
url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
all_items = soup.select('li.product.in-stock')
for item in all_items:
link = urljoin(base_url, item.a['href'])
if any(key in link for key in keywords):
title = item.img["title"]
sizes = [size.text for size in item.select('.sizes > ul > li')]
print(f'ITEM FOUND: {title}\n'
f'sizes available: {", ".join(sizes)}\n'
f'find out more here: {link}\n')
Perhaps you wanted the keywords to be the brands to filter items by, if so, you can use the code below instead of checking if the keyword is in the link to the item.
if item.select_one('.brand').text.lower() in keywords:
instead of:
if any(key in link for key in keywords):
Monitor:
To make a simple monitor that checks for new items on the website, you can use the code below and adjust it to your needs:
from bs4 import BeautifulSoup
import requests
import time
item_storage = dict()
while True:
print('scraping')
html = requests.get('http://localhost:8000').text
soup = BeautifulSoup(html, 'html.parser')
for item in soup.select('li.product.in-stock'):
item_id = item.a['href']
if item_id not in item_storage:
item_storage[item_id] = item
print(f'NEW ITEM ADDED: {item_id}')
print('sleeping')
time.sleep(5) # here you can adjust the frequency of checking for new items
You can test that locally by creating an index.html
file with several <li class="product in-stock">
, you can copy them from the website. Enter Chrome DevTools, find some li
s in the Elements
tab. Right-click on one -> Copy -> Copy outerHTML, then paste it into the index.html
file. Then in console run: python -m http.server 8000
and run the above script. During the execution, you can add some more items and see their href
printed.
Example output:
scraping
NEW ITEM /en-US/product/47341/nike/sneakers/air_maestro_ii_ltd_sneakers
NEW ITEM /en-US/product/47218/y3/sneakers/saikou_sneakers
sleeping
scraping
NEW ITEM /en-US/product/47229/y3/sneakers/tangutsu_slip_on
sleeping