I am currently trying to enrich a dataset for machine learning using a script that allows me to download images from google.
I first browse a dataframe that contains the fields to search on google, with the selenium webdriver I then retrieve the urls of the images to download, and save them in specific folders depending on the field via this function:
def download_image(file_path, url, file_name):
try:
response = requests.get(url)
response.raise_for_status()
with open(os.path.join(file_path, file_name), 'wb') as file:
file.write(response.content)
print(f"Image downloaded successfully to {os.path.join(file_path, file_name)}")
except requests.exceptions.HTTPError as http_error:
print(f"HTTP error occurred: {http_error}")
except Exception as error:
print(f"An error occurred: {error}")
which is called in this loop:
def enhanced_dataset_folder(name:str, tag:str, df):
DRIVER_PATH = "chromedriver"
wd = webdriver.Chrome(DRIVER_PATH)
urls = get_images(tag, wd, 1, 2)
folder_name = name.split('/')[0]
props = tag.split(' ')
test = []
for i, url in enumerate(urls):
try:
img_name = str(i) + "_img"+str(i)+".jpg"
download_image("train/"+folder_name+"/", url, img_name)
except Exception as e:
print('Fail: ', e)
continue
else:
print("ok")
#df.append([folder_name+"/"+img_name,tag,props[0],props[1],props[2]], ignore_index=True)
wd.quit()
The google chrome window and the script always stop at the same time, no matter how many photos I get per page. I have this output, but no error comes out:
Image downloaded successfully to train/1982 Porsche 944/0_img0.jpg
ok
Image downloaded successfully to train/1982 Porsche 944/1_img1.jpg
ok
HTTP error occurred: 403 Client Error: Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy for url: https://upload.wikimedia.org/wikipedia/commons/1/13/1986_944_Turbo.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/0_img0.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/1_img1.jpg
ok
Image downloaded successfully to train/1996 Ferrari 550 Maranello/2_img2.jpg
ok
Image downloaded successfully to train/2001 BMW 3 Series Convertible/0_img0.jpg
ok
After that I have nothing, even if I let it run for more than 10 minutes.
I know the problem is with the download_image
function because when I don't call it the urls are retrieved for each occurrence of the dataframe