I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears.
However, I am running into major issues.
I am using
from fake_useragent import UserAgent
to randomize my header.I am using random sleep times between requests
I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears.
It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser.
So, I assume that by header is just wrong.
I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with.
headers = {
"authority": "seekingalpha.com",
"method": "GET",
"path": "/article/4230872-dillards-still-room-downside",
"scheme": "https",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language": 'en-US,en;q=0.9',
"upgrade-insecure-requests": "1",
"user-agent": RANDOM
}
This is close to what the website uses: They add
"cache-control": "max-age=0",
"if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"',
This to my research is etags which is used for carching and can be use to track people. The 'W/...'
changes each request.
Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here.
So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong?
Edit:
Sessions does not seems to be working
Random headers does not seem to be working
Random sleeps does not seem to be working
I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser.
Selenium does not work.
I really do not want to pay for a service to solve capchas.
I believe the issue is that I am not mimicking the browser well enough.