Why my script stop when scrapping img from web?

Question

I am currently trying to enrich a dataset for machine learning using a script that allows me to download images from google.

I first browse a dataframe that contains the fields to search on google, with the selenium webdriver I then retrieve the urls of the images to download, and save them in specific folders depending on the field via this function:

def download_image(file_path, url, file_name):
    try:
        response = requests.get(url)
        response.raise_for_status()
        with open(os.path.join(file_path, file_name), 'wb') as file:
            file.write(response.content)
        print(f"Image downloaded successfully to {os.path.join(file_path, file_name)}")
    except requests.exceptions.HTTPError as http_error:
        print(f"HTTP error occurred: {http_error}")
    except Exception as error:
        print(f"An error occurred: {error}")

which is called in this loop:

def enhanced_dataset_folder(name:str, tag:str, df):
    DRIVER_PATH = "chromedriver"
    wd = webdriver.Chrome(DRIVER_PATH)
    urls = get_images(tag, wd, 1, 2)
    folder_name = name.split('/')[0]
    props = tag.split(' ')
    test = []
    for i, url in enumerate(urls):
        try:
            img_name = str(i) + "_img"+str(i)+".jpg"
            download_image("train/"+folder_name+"/", url, img_name)
        except Exception as e:
            print('Fail: ', e)
            continue
        else:
            print("ok")
            #df.append([folder_name+"/"+img_name,tag,props[0],props[1],props[2]], ignore_index=True)
    wd.quit()

The google chrome window and the script always stop at the same time, no matter how many photos I get per page. I have this output, but no error comes out:

Image downloaded successfully to train/1982 Porsche 944/0_img0.jpg
ok
Image downloaded successfully to train/1982 Porsche 944/1_img1.jpg
ok
HTTP error occurred: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/0_img0.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/1_img1.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/2_img2.jpg
ok
Image downloaded successfully to train/2001 BMW 3 Series Convertible/0_img0.jpg
ok

After that I have nothing, even if I let it run for more than 10 minutes. I know the problem is with the download_image function because when I don't call it the urls are retrieved for each occurrence of the dataframe

dodrg · Answer 1 · 2023-03-28T22:13:01.467

It was quite helpful to read the error message:

HTTP error occurred: 403 Client Error: Forbidden. 
    Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy 
    for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg

You obviously are violating the policies of the website. To protect themself they can take any countermeasures as they like, also sending fake content to you.

In this case (wikimedia.org) they tell you how they will accept you scrapping their files: https://meta.wikimedia.org/wiki/User-Agent_policy

They expect a proper user agent that allows them to classify the access and contact you. They urge you to send a proper agent-string to identify you as an individual, identifiable bot. – Else they take countermeasures.

They expect the word "bot" within the agent-string. The Syntax of the agent-string expected:

<client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]

# Example:
User-Agent: CoolBot/0.0 (https://example.org/coolbot/; coolbot@example.org) generic-library/0.0

For Python, they also give a sample code scrap:

import requests

url = 'https://example/...'
headers = {'User-Agent': 'CoolBot/0.0 (https://example.org/coolbot/; coolbot@example.org)'}

response = requests.get(url, headers=headers)

So I would suggest to

setup a webpage as contact page / vcard for your bot, including an e-mail address. A small introduction of your project might be helpful.
Customize the agent string to identify the keyword "bot", your intent and the contact page

Then give the bot a run and tell, if things got better.

About the 403 - Forbidden

This is a qualified response to a GET or POST request. By this answer the HTTP request is finished and your script has to decide what to do next.
=> Your script decides to continue after writing to the log.

If you would have been block generally (i.e. by access rule for your IP address) you would see 403 for every single access to this server.
=> This is not the case within this logfile.

'Forbidden' occurs when accessing a restricted resource. As you get your URLs form a google search URLs to restricted files are possible, as the URLs might be published in the public area of a website.
=> There is nothing special with a 403 at the first glance.

The possibility with a 403 being a trigger is the combination of a 403-hit followed by a problem on a regular basis at the same site (or sites hosted by the same guys).

=> Some more details about these 403 combined with the problem would be nice.

As you write the problem disappeared: what have you changed?
Or did you just get a new search result form Google prioritizing other sites?

The answer to your question

Your statement significantly increases the probability of the 403-causing URL as a trigger URL:

I didn't change anything except bypassing the url causing the first 403 error which led to my script stopping. I didn't find yet the best behavior for this algorithm but this workaround allowed me to enrich my dataset

By doing this you bypassed the trigger.

The best thing for your project to avoid the problem is to gain acceptance by the scrapped websites (see above).

When they notice their trigger is discovered and bypassed, they choose another URL as trigger and the game restarts. — Don't be astonished, when your IP(-range) or fingerprinted profile gets blacklisted.

Summary

The problem does not come from your code but from the bot-tool and its settings. Violation of the usage policy will cause a reaction and is a common effect in the internet.

(I'm sure, you don't like the answer...)

Thank you for the advice about the 403 error, but the script pause is not related to this error, because even if I get the 403 error in my output, the script goes to the next url and stops well after passing the problematic url... So even after applying your advice it doesn't solve my problem unfortunately, the script keeps blocking at the same time :( — N7Legend, Mar 24 '23 at 11:11
Does your agent string contain the required informations? Are you only scrapping Wikimedia or also other websites? — How differs the behavior? (no 403 / other amount of seconds / ... ) – That the flow of data continues after the 403 is only the sign, that you hit a trigger. What happens afterwards might give you the "few extra seconds", perhaps to prevent you from recognizing the 403-trigger as the eject button for the bullet hitting you. — Its also the chance for regular users to not get bothered only because they accidentally triggered a checkpoint — or the file is just regularly restricted — dodrg, Mar 24 '23 at 11:41
Thanks for all the clarification, I get images from any source (not just wikimedia). Apart from the --remote-allow-origins argument, my selenium agent is configured by default. I just tested without looking for the occurrence that causes this 403 error and the script does not stop anymore, what I do not understand is that at times other 403 errors appear but do not block the execution at all, any idea what could explain this behaviour? — N7Legend, Mar 24 '23 at 18:59
*403 - "Forbidden"* is a common HTTP status code. You're just not allowed to access this resource. Depending of the server configuration this can also happen to files that do not exist and you would expect an *404 - "Not found"*. As you obtain your URLs by google search, files with restricted access can be included, just because google found a link to that file – So nothing special about it. – The interpretation this could be a trigger comes with the situation when you repeatedly get a reaction within a time frame, after you've hit a single 403. — By sending a 403 the GET request is finished — dodrg, Mar 25 '23 at 08:44
As the error seems to have disappeared: Did you change anything? — dodrg, Mar 25 '23 at 08:50
I didn't change anything except bypassing the url causing the first 403 error which led to my script stopping. I didn't find yet the best behavior for this algorithm but this workaround allowed me to enrich my dataset — N7Legend, Mar 28 '23 at 20:55
Well, that's almost a prove for the URL with the 403 being a trigger. — I updated my answer. — dodrg, Mar 28 '23 at 22:15
You do not think the answer is given? — Where is the problem? — dodrg, Mar 30 '23 at 07:23

PyGuy · Accepted Answer · 2023-03-29T16:11:23.567

You are calling the HTTP server in synchronous mode, which means when the socket is connected your script would wait until the data is received and the connection is closed, or you have pressed ^C. This is a trick implemented by the firewall/web-server of the service you are trying to use.

You can switch to aiohttp to be able to perform several calls in asynchronous mode. You need to be careful to adjust your connection rate properly and introduce some proper gaps between your calls. This answer might help you: aiohttp: rate limiting parallel requests

You can use asyncio.sleep after creating a set of requests, and if they don't finish in the expected time, you can drop the future objects - which effectively means you are dropping your side of the connectin.

undetected Selenium · Answer 3 · 2023-03-23T23:37:09.737

This error message...

HTTP error occurred: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg

...implies that HTTP 403 Forbidden response status code was encountered while accessing a valid URL.

Deep Dive

Possibly it's the same issue of Invalid Status code=403 text=Forbidden which we had been discussing for quite sometime now.

Solution

A blanket solution would be to add the argument --remote-allow-origins=* through an instance of Options as follows:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--remote-allow-origins=*")
DRIVER_PATH = "chromedriver"
wd = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)

Thank you for the advice about the 403 error, but the script pause is not related to this error, because even if I get the 403 error in my output, the script goes to the next url and stops well after passing the problematic url... So even after applying your advice it doesn't solve my problem unfortunately, the script keeps blocking at the same time :( — N7Legend, Mar 24 '23 at 11:11