2

I am creating a python scraper for a website to pull price, product number, cat number, description. When I run this script it only pulls the first item of the page then moves on to the next url. New to python just wondering how I can modify to pull all of the products from the page. Thanks to clarify the first url only has one product on it but the second the third all have many products that are not being pulled.

 import requests
from bs4 import BeautifulSoup
import random
import time

product_urls = [
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation', 
]

for URL in product_urls:
    page = requests.get(URL)
    soup = BeautifulSoup(page.text,"lxml")
    timeDelay = random.randrange(5, 25)

    for item in soup.select('.content'):
        cat_name = item.select_one('.title').text.strip()
        cat_discription = item.select_one('.copy').text.strip()
        product_name = (item.find('div',{'class':'headline'}).text.strip())
        product_discription = (item.find('div',{'class': 'copy'}).text.strip())
        product_number = (item.find('td',{'class': 'textLeft paddingTopLess'}).text.strip())
        cat_number = (item.find('td',{'class': 'textRight paddingTopLess2'}).text.strip())
        product_price = (item.find('span',{'class': 'prc'}).text.strip())
        print("Catagory Name: {}\n\nCatagory Discription:  {}\n\nProduct Name:  {}\n\nProduct Discription:  {}\n\nProduct Number:  {}\n\nCat No:  {}\n\nPrice:  {}\n\n".format(cat_name,cat_discription,product_name,product_discription,product_number,cat_number,product_price))
        time.sleep(timeDelay)
  • Does `soup.select('.content')` return more than one item? – John Gordon Jan 26 '18 at 17:52
  • soup.select('.container') produces only this. – user9269112 Jan 26 '18 at 18:00
  • Check if any of your `find`s returns `None`. – user2314737 Jan 26 '18 at 18:00
  • Catagory Name: miScript Precursor Assays Catagory Discription: miScript Precursor Assays are precursor-miRNA– Product Name: Advanced search settings Product Discription: miScript Precursor Assays are precursor-miRNA– Product Number: Cat No: Varies Price: $91.80 – user9269112 Jan 26 '18 at 18:01
  • It appears that on each webpage, only one result is listed with the product information mentioned. Are you trying to scrape the products listings for all kits on the right column of the page? – Ajax1234 Jan 26 '18 at 18:01
  • im trying to pull all of the products in the table. The second url has more products on it. https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation – user9269112 Jan 26 '18 at 18:04

1 Answers1

0

You can get the tables elements from the pane class div. The 4th table is the main product & the 5th (when existing) is the additionnal products.

In the following example, I use list comprehension to output list of tuples with title, description, product n°, category n° & price :

from bs4 import BeautifulSoup
import requests

product_urls = [
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation', 
    'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/miscript-target-protectors/#orderinginformation',
    'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/two-step-qrt-pcr/miscript-sybr-green-pcr-kit/#orderinginformation'
]

session = requests.Session()

for URL in product_urls:

    response = session.get(URL)
    soup = BeautifulSoup(response.content, "html.parser")

    tables = soup.find_all("div", {"class":"pane"})[0].find_all("table")

    if (len(tables) > 4):
        product_list = [
            (
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),     #description
                t[1].text.strip(),                                         #product number
                t[2].text.strip(),                                         #category number
                t[3].text.strip()                                          #price
            )
            for t in (t.find_all('td') for t in tables[4].find_all('tr'))
            if t
        ]
    elif (len(tables) == 1):
        product_list = [
            (
                t[0].findAll("div", {"class":"catNo"})[0].text.strip(),    #catNo
                t[0].findAll("div", {"class":"headline"})[0].text.strip(), #headline
                t[0].findAll("div", {"class":"price"})[0].text.strip(),    #price
                t[0].findAll("div", {"class":"copy"})[0].text.strip()      #description
            )
            for t in (t.find_all('td') for t in tables[0].find_all('tr'))
            if t
        ]
    else:
        print("could not parse main product")

    print(product_list)

    if len(tables) > 5:
        add_product_list = [
            (
                t[0].findAll("div", {"class":"title"})[0].text.strip(), #title
                t[0].findAll("div", {"class":"copy"})[0].text.strip(),  #description
                t[1].text.strip(),                                      #product number
                t[2].text.strip(),                                      #category number
                t[3].text.strip()                                       #price
            )
            for t in (t.find_all('td') for t in tables[5].find_all('tr'))
            if t
        ]
        print(add_product_list)

Check this answer if you want to convert the tuples index into single list for each field

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • Thank you so much this is so helpful. as someone who is still trying to learn python more detailed how did you find the numbers of the tables? did you just count them and put them in order. Know this must be a very simple question for you. – user9269112 Jan 29 '18 at 05:17
  • Yes I count them, I didn't find any parent class/id that would identify them precisely but if you find a more specific way to find them it would do as well – Bertrand Martel Jan 29 '18 at 06:01
  • Thank you so much this is very helpful and thank you for the link. I have one question how did you come up with the number for tables such as table[4]? Still pretty new to python so just want to make sure I understand all the steps. – user9269112 Jan 29 '18 at 15:34
  • @user9269112 in Chrome developer tool using the inspector you can view the element by selecting items which is useful when identifying what to scrape. In the script above, you watch for the first div with class "pane" and look for all the tables element within it. I noticed that there are 4 tables when no additionnal product is shown & a 5th table when additionnal product are present – Bertrand Martel Jan 29 '18 at 15:55
  • Awesome! Ill have to look into using chrome I have been using Firefox for this project. One more question sorry again if it is very simple but i want to add a few more urls to this scrip but get an index out of range error is that because the tables is set to > 5? – user9269112 Jan 29 '18 at 16:52
  • here is one of the problematic ones. 'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/miscript-target-protectors/#orderinginformation', I think i need to add an if else statement for urls like this – user9269112 Jan 29 '18 at 18:09
  • I've just added the url above to the list but can't reproduce the error it gives the main product and no additionnal product as expected for this page – Bertrand Martel Jan 29 '18 at 18:18
  • sorry this is the one that is acting up. 'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/two-step-qrt-pcr/miscript-sybr-green-pcr-kit/#orderinginformation', this is the error IndexError: list index out of range – user9269112 Jan 29 '18 at 18:26
  • @user9269112 yes in that case you would check the size of tables and take the first one, the mechanism is the same for the others, see the updated answer. Note that the model (the fields) is different from the 2 others (main product & additionnal product the former urls) – Bertrand Martel Jan 29 '18 at 18:43
  • thank you so much for your help @Bertrand Martel! Now if I was to run across more urls that had different tags for price could I just add in an or statement to have it look for both in the same table? – user9269112 Jan 30 '18 at 17:49
  • @user9269112 i guess for the same table, you would check the td tag classname like `t[0].findAll("div", {"class":"headline"})[0]` if you have table with same position (table[4]) with different content (or disposition in table) you'll need to check presence of other fields maybe some class in div or ids – Bertrand Martel Jan 30 '18 at 18:17
  • yeah on this one url it appears that all the products come up together and not in two different tables as the previous urls were and i think the table with the products on it is now table 2. also the tags have also changed in this url. here is the link if you want to confirm this. [https://www.qiagen.com/us/shop/rnai/mmhs-mapk1-control-sirna/#orderinginformation] – user9269112 Jan 30 '18 at 18:31
  • @user9269112 in that case we fall under the branch `len(tables) == 1` so it get category n°, headline, price & description. there is no other table but is it a problem ? – Bertrand Martel Jan 30 '18 at 18:40
  • think I found the problem its coming from this URL [https://www.qiagen.com/us/shop/protein-and-cell-assays/multi-analyte-elisarray-kits/#orderinginformation] I'm getting the list out of index again (for t in (t.find_all('td') for t in tables[4].find_all('tr')) if t – user9269112 Jan 30 '18 at 18:55
  • what would be the best way to save this to a CSV file? I am not sure how to do this since its in a tuple. – user9269112 Feb 07 '18 at 20:32