0

I have written two lots of code for my project. One will scrape the product data from each product page and the other will scrape the images. The code below scrapes product data and outputs to CSV.

I want to be able to run this for 1000+ product pages without getting blocked. Can someone point me in the right direction for how this can be done? I am very new this and have mostly written the following code myself (with some help from some fantastic people on here to get the tricky stuff sorted).

I have no idea how to start with rotating proxies so any help is appreciated!!

import requests
from bs4 import BeautifulSoup
import pandas as pd

url_list = []

def getProduct_Data(tag):
    url = f'https://www.whiteline.com.au/product_detail4.php?part_number={tag}'

    r = requests.get(url)

    soup = BeautifulSoup(r.text, 'html.parser')

    Product_Data = {
    'sku': soup.find("div",{"class":"head2BR"}).text,

    'product': soup.find("div",{"style":"float:left; margin-left:24px; margin-top:8px; margin-right:14px;"}).text,

    'description': soup.find("div",{"style":"min-height:80px; max-height:224px; overflow-y: scroll;"}).text.strip(),

    'price': soup.find("div",{"class":"float_Q"}).text.strip(),

    'features': soup.find("div",{"class":"grey1"}).text.strip(),

    'contents': soup.find("div",{"style":"max-height:28px; overflow-y:scroll;"}).text.strip(),

    'compatiblity': soup.find("div",{"style":"width:960px; margin:auto; padding-top:18px;"}).text.strip(),
    }
    url_list.append(Product_Data)
    return

getProduct_Data('KBR15')
getProduct_Data('W13374')
getProduct_Data('BMR98')
getProduct_Data('W51210')
getProduct_Data('W51211')
getProduct_Data('W92498')
getProduct_Data('W93404')
getProduct_Data('W92899')
getProduct_Data('W51710')
getProduct_Data('W53277')
getProduct_Data('W53379')
getProduct_Data('BSK010M')
getProduct_Data('KSB568')

df = pd.DataFrame(url_list)
df.to_csv('whitelinefull.csv')
print('Fin.')
Lynda Harmer
  • 41
  • 1
  • 4
  • `requests` supports [proxy specification](https://stackoverflow.com/questions/8287628/proxies-with-python-requests-module), so if you have a list of IPs, you could devise a strategy for rotation (a random proxy every requests, or every `n` number of requests). However, you could also use a service like [Scrapingbee](https://www.scrapingbee.com/), which handles all of that for you. The aforementioned service gives users 1000 free calls to start. – Ajax1234 Jul 28 '21 at 15:21
  • Thanks for the help over the last few days - I think I have figured it out. I added the below code to the start and it seems to have worked! – Lynda Harmer Jul 29 '21 at 03:18
  • def proxy_generator(): response = requests.get("https://sslproxies.org/") soup = BeautifulSoup(response.content, 'html5lib') proxy = {'https': choice(list(map(lambda x:x[0]+':'+x[1], list(zip(map(lambda x:x.text, soup.findAll('td')[::8]), map(lambda x:x.text, soup.findAll('td')[1::8]))))))} return proxy – Lynda Harmer Jul 29 '21 at 03:19

0 Answers0