1

Hello fellow coders :)

So as part of my research project I need to scrape data out of a website. Obviously it detects bots therefore I am trying to implement proxies on a loop I know works (getting the brands url):

The working loop:

brands_links= []
for country_link in country_links:
    r = requests.get(url + country_link, headers=headers)
    soup_b = BeautifulSoup(r.text, "lxml")
    for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
        for link in link.find_all('a'):
            durl = link.get('href')
            brands_links.append(durl)

The loop using proxies:

brands_links= []
i = 0 
while i in range(0, len(country_links)):
    print(i)
    try:
        proxy_index = random.randint(0, len(proxies) - 1)
        proxy = {"http": proxies[proxy_index], "https": proxies[proxy_index]}
        r = requests.get(url + country_links[i], headers=headers, proxies=proxy, timeout=10)
        soup_b = BeautifulSoup(r.text, "lxml")
        for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
            for link in link.find_all('a'):
                durl = link.get('href')
                brands_links.append(durl)
        
        if durl is not None :
            print("scraping happening")
            i += 1
        else: 
            continue
    
    except:
        print("proxy not working")
        proxies.remove(proxies[proxy_index]) 
    
    if i == len(country_links):
        break
    else:
        continue

Unfortunately it does not scrape all the links.

With the working loop only using headers I get a list of lenght 3788. With this one I only get 2387.

By inspecting the data I can see it skips some country links hence the difference in length. I am trying to force the loop to scrape all the links with the "if" statement but it does not seem to work.

Anyone knows what I am doing wrong or got an idea which would make it scrape everything? Thanks in advances

IYJJ
  • 11
  • 2
  • Any chance you could share the URL? – Paul M. Aug 08 '20 at 11:29
  • @PaulM. the url is: 'https://www.fragrantica.com' . Please bare in mind that to run the loop I need to scrape the country links first :) – IYJJ Aug 08 '20 at 11:35
  • For some reason stack overflow automatically tranfroms the link. It is preceeded by https:// – IYJJ Aug 08 '20 at 11:47

2 Answers2

1

Thanks for sharing the link.

You said:

Obviously it detects bots therefore I am trying to implement proxies...

What makes you think this? Here is some code I came up with, which seems to scrape all the divs, as far as I can tell:

def main():

    import requests
    from bs4 import BeautifulSoup

    countries = (
        ("United States", "United+States.html"),
        ("Canada", "Canada.html"),
        ("United Kingdom", "United+Kingdom.html")
    )

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
    }

    for country, document in countries:
        url = f"https://www.fragrantica.com/country/{document}"

        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, "html.parser")
        divs = soup.find_all("div", {"class": "designerlist"})
        print(f"Number of divs in {country}: {len(divs)}")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

Number of divs in United States: 1016
Number of divs in Canada: 40
Number of divs in United Kingdom: 308
>>> 
Paul M.
  • 10,481
  • 2
  • 9
  • 15
  • Hello Paul, thank you for your quick reply! So the scraping of the brands work without any problems. The issue starts when I scrape the perfumes URLs from the brands URLs. Since there are more that 60000 perfumes, the website detects a bot after a few hundred perfumes scraped. That is the reason why I build the loop in the question. To test the proxies and move on to the next one when it is actually scraped. I hope I made my problem clear now :) – IYJJ Aug 09 '20 at 09:29
  • Also could you explain what is the if statement at the end for? – IYJJ Aug 09 '20 at 09:33
  • @IYJJ Thanks for the clarification. I'll take a look and see what I can come up with. [Here is a link to a thread which answers your question about the if-statement](https://stackoverflow.com/a/419185/10987432). In a nutshell, it's just a cute way of conditionally executing the main function, depending on whether or not the current module is the main program, or if it's being imported by another module. – Paul M. Aug 09 '20 at 11:40
0

So I found a way to for the loop to keep the scraping until it actually scrapes the link. Here's the updated code:

brands_links= []
i = 0 
while i in range(0, len(country_links)):
    print(i)
    try:
        proxy_index = random.randint(0, len(proxies) - 1)
        proxy = {"http": proxies[proxy_index], "https": proxies[proxy_index]}
        r = requests.get(url + country_links[i], headers=headers, proxies=proxy, timeout=10)
        soup_b = BeautifulSoup(r.text, "lxml")
        for link in soup_b.find_all("div", class_='designerlist cell small-6 large-4'):
            for link in link.find_all('a'):
                durl = link.get('href')
                brands_links.append(durl)
    
    except:
        print("proxy not working")
        proxies.remove(proxies[proxy_index]) 
        continue 
    
    try :
        durl
    except NameError:
        print("scraping not happening")
        continue
    else: 
        print("scraping happening")
        del durl

    i += 1
    if i == len(country_links):
        break
    else:
        continue

So it is the last if statement which checks if the link was actually scraped.

I am not really familiar with functions. So if anyone has a way to make it simpler or more efficient I would highly appreciate. As for now I will be using @Paul M function to improve my loop or tranform it into a function.

IYJJ
  • 11
  • 2