scrape only new link added after scrape the site

Question

I have a code that scrape all links, titles and sizes of products with certain keywords. After the first scrape is done i want the script check again and again if new item are added. I try while True: but it seems doesnt work because gives me the same data multiple time. The script is this:

import requests
import csv
from bs4 import BeautifulSoup
import time

headers = {"user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 
Safari/537.36"}
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

while True:
    for page in range(0,11):
        url = "https://www.julian-fashion.com/en-US/men/shoes/sneakerscurrPage={}".format(page)
        r = requests.get(url)
        soup = BeautifulSoup(r.content,"html.parser")
        all_links = soup.find_all("li", attrs={"class":"product in-stock"})
        for link in all_links:
            for s in keywords:
                if s not in link.a["href"]:
                    found = False
                    break
                else:
                    product = link.a["href"]
                    found = True
                    if found:
                        print("Product found.")
                        print(base_url+link.a["href"])
                        print(link.img["title"])
                        print(link.div.div.ul.text)

Could you please change the URL back to the one from the original question (before edits), because my answer addresses that issue and will become unclear if you attempt to fix the code after getting the answer. The ending used to be: `sneakerscurrPage={}` and I comment on the missing `?`. — radzak, Apr 09 '18 at 10:41

radzak · Accepted Answer · 2018-04-09T11:38:06.570

You are missing ? before currPage, it should look like that: https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}.? indicates a start of the query string. Now your code will work.

You also can omit page 0, because this site starts pagination from 1 and providing 0 gives 404 Page not found. Besides that, you don't need while True because you want to execute this block of code only once. For loop takes care of changing pages and it is enough.

There is a bug here:

for s in keywords:
    if s not in link.a["href"]:
        found = False
        break

you break from the loop if a keyword is not in link.a['href']. Notice that if the first keyword from your list is not there, it doesn't mean one of the next ones won't.

Your code after a few fixes:

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

for page in range(1, 11):
    print(f'[PAGE NR {page}]')
    url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "html.parser")
    all_links = soup.find_all("li", attrs={"class": "product in-stock"})
    for link in all_links:
        if any(key in link.a["href"] for key in keywords):
            print("Product found.")
            print(base_url + link.a["href"])
            print(link.img["title"])
            print(link.div.div.ul.text)

Here is my version of the code, I used .select() instead of .find_all(). This is better, because if the creators of the page will add some new classes to the elements you search for, .select() that uses CSS selectors will still be able to target these elements. I also used urljoin to create absolute links, see here why.

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

keywords = ["nike", "air"]
base_url = "https://www.julian-fashion.com"

for page in range(1, 11):
    print(f'[PAGE NR {page}]')
    url = "https://www.julian-fashion.com/en-US/men/shoes/sneakers?currPage={}".format(page)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    all_items = soup.select('li.product.in-stock')

    for item in all_items:
        link = urljoin(base_url, item.a['href'])

        if any(key in link for key in keywords):
            title = item.img["title"]
            sizes = [size.text for size in item.select('.sizes > ul > li')]

            print(f'ITEM FOUND: {title}\n'
                  f'sizes available: {", ".join(sizes)}\n'
                  f'find out more here: {link}\n')

Perhaps you wanted the keywords to be the brands to filter items by, if so, you can use the code below instead of checking if the keyword is in the link to the item.

    if item.select_one('.brand').text.lower() in keywords:

instead of:

    if any(key in link for key in keywords):

Monitor:

To make a simple monitor that checks for new items on the website, you can use the code below and adjust it to your needs:

from bs4 import BeautifulSoup
import requests
import time

item_storage = dict()

while True:
    print('scraping')
    html = requests.get('http://localhost:8000').text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('li.product.in-stock'):
        item_id = item.a['href']

        if item_id not in item_storage:
            item_storage[item_id] = item
            print(f'NEW ITEM ADDED: {item_id}')

    print('sleeping')
    time.sleep(5)  # here you can adjust the frequency of checking for new items

You can test that locally by creating an index.html file with several <li class="product in-stock">, you can copy them from the website. Enter Chrome DevTools, find some lis in the Elements tab. Right-click on one -> Copy -> Copy outerHTML, then paste it into the index.html file. Then in console run: python -m http.server 8000 and run the above script. During the execution, you can add some more items and see their href printed.

Example output:

scraping
NEW ITEM /en-US/product/47341/nike/sneakers/air_maestro_ii_ltd_sneakers
NEW ITEM /en-US/product/47218/y3/sneakers/saikou_sneakers
sleeping
scraping
NEW ITEM /en-US/product/47229/y3/sneakers/tangutsu_slip_on
sleeping

Hello! Thanks for Your answer and suggestion! What i didn't know is how to make the script wait to get a new link added in the website. I tried with while true but didn't work. Thanks — Phil, Apr 07 '18 at 20:41
You mean a script running 24/7 that prints items once they get added? — radzak, Apr 07 '18 at 20:43
@Phil I updated answer with the example of a simple monitor. — radzak, Apr 09 '18 at 11:40
Thanks for Your Reply. I re-edited the link. I try to use the last script but i'm getting:iMac-di-Filippo:desktop phil$ python moni.py File "moni.py", line 17 print(f'NEW ITEM ADDED: {item_id}') ^ SyntaxError: invalid syntax — Phil, Apr 09 '18 at 17:23
Oh yeah, forgot to mention, I used [`f-strings`](https://www.python.org/dev/peps/pep-0498/) to format the strings. It works for Python 3.6+ only. You can just change it to `print('NEW ITEM ADDED: {}'.format(item_id))`. — radzak, Apr 09 '18 at 17:40
Thanks again for Your support! Do you suggest to save scraped data in a file.txt or in a db? — Phil, Apr 10 '18 at 10:05
If you want to filter the items by sizes, etc., make some more complicated queries, maybe you'll need a db, but otherwise, a txt file should be enough. — radzak, Apr 10 '18 at 11:15

scrape only new link added after scrape the site

1 Answers1