Hello fellow coders :)
So as part of my research project I need to scrape data out of a website. Obviously it detects bots therefore I am trying to implement proxies on a loop I know works (getting the brands url):
The working loop:
brands_links= []
for country_link in country_links:
r = requests.get(url + country_link, headers=headers)
soup_b = BeautifulSoup(r.text, "lxml")
for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
for link in link.find_all('a'):
durl = link.get('href')
brands_links.append(durl)
The loop using proxies:
brands_links= []
i = 0
while i in range(0, len(country_links)):
print(i)
try:
proxy_index = random.randint(0, len(proxies) - 1)
proxy = {"http": proxies[proxy_index], "https": proxies[proxy_index]}
r = requests.get(url + country_links[i], headers=headers, proxies=proxy, timeout=10)
soup_b = BeautifulSoup(r.text, "lxml")
for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
for link in link.find_all('a'):
durl = link.get('href')
brands_links.append(durl)
if durl is not None :
print("scraping happening")
i += 1
else:
continue
except:
print("proxy not working")
proxies.remove(proxies[proxy_index])
if i == len(country_links):
break
else:
continue
Unfortunately it does not scrape all the links.
With the working loop only using headers I get a list of lenght 3788. With this one I only get 2387.
By inspecting the data I can see it skips some country links hence the difference in length. I am trying to force the loop to scrape all the links with the "if" statement but it does not seem to work.
Anyone knows what I am doing wrong or got an idea which would make it scrape everything? Thanks in advances