How can I download images from a CAPTCHA with Python?

Question

I need to download the images that are inside the custom made CAPTCHA in this login site. How can I do it :(?

This is the login site, there are five images

and this is the link: https://portalempresas.sb.cl/login.php

I've been trying with this code that another user (@EnriqueBet) helped me with:

from io import BytesIO
from PIL import Image

# Download image function
def downloadImage(element,imgName):
    img = element.screenshot_as_png
    stream = BytesIO(img)
    image = Image.open(stream).convert("RGB")
    image.save(imgName)

# Find all the web elements of the captcha images    
image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")

# Output name for the images
image_base_name = "Imagen_[idx].png"

# Download each image
for i in range(len(image_elements)):
    downloadImage(image_elements[i],image_base_name.replace("[idx]","%s"%i)

But when it tries to get all of the image elements

image_elements = driver.find_elements_by_xpath("/html/body/div[1]/div/div/section/div[1]/div[3]/div/div/div[2]/div[*]")

It fails and doesn't get any of them. Please, help! :(

score 0 · Answer 1 · answered Apr 08 '20 at 03:12

0

Instead of defining an explicit path to the images, why not simply download all images that are present on the page. This will work since the page itself only has 5 images and you want to download all of them. See the method below.

The following should extract all images from a given page and write it to the directory where the script is being run.

import re
import requests
from bs4 import BeautifulSoup

site = ''#set image url here
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

The code is taken from here and credit goes to the respective owner.

answered Apr 08 '20 at 03:12

AzyCrw4282

7,222
5
19
35

thank you so much bro but it doesn't work :( the code download all the images under an 'img' tag, the ones on the site that i'm working on doesn't have that :( – Sebastián Muñoz Alvial Apr 08 '20 at 03:35
Ok. After a second look, it looks like using a parser will not be a good option. Your code in the question is the right way to go. In order for me to debug whats going wrong can you tell me what the `driver` is? Are you using selenium? – AzyCrw4282 Apr 08 '20 at 04:19
Yes, I'm using Selenium and I'm using Chrome – Sebastián Muñoz Alvial Apr 08 '20 at 04:25
Unfortunately, I am having problems getting selenium to run (version issues), and so the best option I can recommend is to get in contact with @EnriqueBet to help you out on this – AzyCrw4282 Apr 08 '20 at 04:46
@SebastiánMuñozAlvial do you have an update on the problem? Have you also seen the new answer? – AzyCrw4282 Apr 09 '20 at 20:09

score 0 · Answer 2 · answered Apr 08 '20 at 06:32

This is a follow on answer from my earlier post

I have had no success getting my selenium to run due to versioning issues on selenium and my browser.

I have though thought of another way to download and extract all the images that are appearing on the captcha. As you can tell the images change on each visit, so to collect all the images the best option would be to automate them rather than manually saving the image from the site

To automate it, follow the steps below.

Firstly, navigate to the site using selenium and take a screenshot of the site. For example,

from selenium import webdriver

DRIVER = 'chromedriver'
driver = webdriver.Chrome(DRIVER)
driver.get('https://www.spotify.com')
screenshot = driver.save_screenshot('my_screenshot.png')
driver.quit()

This saves it locally. You can then open the image using library such as pil and crop the images of the captcha.

This would be done like so

im = Image.open('0.png').convert('L')
im = im.crop((1, 1, 98, 33))
im.save('my_screenshot.png)

Hopefully you get the idea here. You will need to do this one by one for all the images, ideally in a for loop with crop diemensions changed appropriately.

score 0 · Answer 3 · answered Sep 27 '20 at 12:45

You can also try this It will save captcha image only

from PIL import Image
element = driver.find_element_by_id('captcha_image') #div id or the captcha container id

location = element.location

#print(location)

size = element.size
driver.save_screenshot('screenshot.png')

get_captcha_text(location, size)

def get_captcha_text(location, size):
    
    im = Image.open('screenshot.png')
    left = location['x']
    top = location['y']
    right = location['x'] + size['width']
    bottom = location['y'] + size['height']
    im = im.crop((left, top, right, bottom)) # defines crop points
    im.save('screenshot.png')
    return true

How can I download images from a CAPTCHA with Python?

3 Answers3