I have written two lots of code for my project. One will scrape the product data from each product page and the other will scrape the images. The code below scrapes product data and outputs to CSV.
I want to be able to run this for 1000+ product pages without getting blocked. Can someone point me in the right direction for how this can be done? I am very new this and have mostly written the following code myself (with some help from some fantastic people on here to get the tricky stuff sorted).
I have no idea how to start with rotating proxies so any help is appreciated!!
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_list = []
def getProduct_Data(tag):
url = f'https://www.whiteline.com.au/product_detail4.php?part_number={tag}'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
Product_Data = {
'sku': soup.find("div",{"class":"head2BR"}).text,
'product': soup.find("div",{"style":"float:left; margin-left:24px; margin-top:8px; margin-right:14px;"}).text,
'description': soup.find("div",{"style":"min-height:80px; max-height:224px; overflow-y: scroll;"}).text.strip(),
'price': soup.find("div",{"class":"float_Q"}).text.strip(),
'features': soup.find("div",{"class":"grey1"}).text.strip(),
'contents': soup.find("div",{"style":"max-height:28px; overflow-y:scroll;"}).text.strip(),
'compatiblity': soup.find("div",{"style":"width:960px; margin:auto; padding-top:18px;"}).text.strip(),
}
url_list.append(Product_Data)
return
getProduct_Data('KBR15')
getProduct_Data('W13374')
getProduct_Data('BMR98')
getProduct_Data('W51210')
getProduct_Data('W51211')
getProduct_Data('W92498')
getProduct_Data('W93404')
getProduct_Data('W92899')
getProduct_Data('W51710')
getProduct_Data('W53277')
getProduct_Data('W53379')
getProduct_Data('BSK010M')
getProduct_Data('KSB568')
df = pd.DataFrame(url_list)
df.to_csv('whitelinefull.csv')
print('Fin.')