I'm trying to scrape a site that requires the user to enter the search value and a captcha. I've got an optical character recognition (OCR) routine for the captcha that succeeds about 33% of the time. Since the captchas are always alphabetic text, I want to reload the captcha if the OCR function returns non-alphabetic characters. Once I have a text "word", I want to submit the search form.
The results come back in the same page, with the form ready for a new search and a new captcha. So I need to rinse and repeat until I've exhausted my search terms.
Here's the top-level algorithm:
- Load page initially
- Download the captcha image, run it through the OCR
- If the OCR doesn't come back with a text-only result, refresh the captcha and repeat this step
- Submit the query form in the page with search term and captcha
- Check the response to see whether the captcha was correct
- If it was correct, scrape the data
- Go to 2
I've tried using a pipeline for getting the captcha, but then I don't have the value for the form submission. If I just fetch the image without going through the framework, using urllib or something, then the cookie with the session is not submitted, so the captcha validation on the server fails.
What's the ideal Scrapy way of doing this?