8

I am trying to scrape some articles from xyz However, after a certain number of scrapes, a captcha appears.

However, I am running into major issues.

  1. I am using from fake_useragent import UserAgent to randomize my header.

  2. I am using random sleep times between requests

  3. I am changing IP address using a VPN once a captcha appears. However, somehow a captcha still appears once my IP address appears.

It is also strange because while a captcha appears in the request response, a captcha does not appear in the browser.

So, I assume that by header is just wrong.

I turned off js and cookies when obtaining this request because with cookie and js, there is clearly info that the website is tracking me with.

headers = {
    "authority": "seekingalpha.com",
    "method": "GET",
    "path": "/article/4230872-dillards-still-room-downside",
    "scheme": "https",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": 'en-US,en;q=0.9',
    "upgrade-insecure-requests": "1",
    "user-agent": RANDOM
}

This is close to what the website uses: They add

"cache-control": "max-age=0",
"if-none-match": 'W/"6f11a6f9219176fda72f3cf44b0a2059"',

This to my research is etags which is used for carching and can be use to track people. The 'W/...' changes each request.

Also, when I use wkhtmltopdf to print the screen as pdf, I a captcha never appears. I have also tried using selenium which is even worse. In addition, I have tried using proxies as seen here.

So there definitely is a way of doing this. However, I am not doing it correctly. Does anyone have an idea what I am doing wrong?

Edit:

  1. Sessions does not seems to be working

  2. Random headers does not seem to be working

  3. Random sleeps does not seem to be working

  4. I am able to access the webpage using my VPN. Even once a capcha appears using requests, there is no captcha on the website in the browser.

  5. Selenium does not work.

  6. I really do not want to pay for a service to solve capchas.

I believe the issue is that I am not mimicking the browser well enough.

user2330624
  • 303
  • 1
  • 7
  • 16
  • Are you sure your VPN/proxy is being used? Also, are you using `requests` sessions or just direct requests? – Aziz Jan 02 '19 at 19:11
  • Yes VPN is being used. using `requests.get("http://httpbin.org/ip")` to check. – user2330624 Jan 02 '19 at 19:16
  • Good :) There could be a number of reasons .. some websites receive a lot of traffic from common VPN IP addresses and treat them as potential threats. If you're using proxies (not VPN), some proxy servers are not anonymous and they pass your actual IP address in the request. Also, some website may flag fake User-Agents (if they were for old browsers or not a real user agent). I'm testing the website you posted (SeekingAlpha) to see if I get captchas too. I'll get back to you :) – Aziz Jan 02 '19 at 19:21

1 Answers1

2

It is not easy to pinpoint the exact reason for being blocked and facing a Captcha. Here are few thoughts:

VPN and Proxies

Sometimes, the Captcha service (in this case, Google) may blacklist common VPN IP addresses and treat them as potential threats, since many people are using them and they generate a lot of traffic.

Sometimes, proxy servers (especially free ones) are not anonymous and can send your actual IP address in the request headers (specifically, the X-Forwarded-For header)

Request Headers

There are certain headers that are important to have in your request. The easiest way to make your requests look legitimate is to use the "Network" tab in your browser's "Developer Tools", and copy all the headers your browser sends.

An important header to have is referer. While it may or may not be checked by the website, it is safer to just have it there with the URL of one of the website's pages (or homepage):

referer: https://seekingalpha.com/

Timeouts and Sessions

Try to increase the timeouts between your requests. Few seconds should be reasonable.

Finally, try using the session objects in requests. They automatically maintain the cookies and update the referer across multiple requests, to emulate a real user browsing the website. I found them to be the most helpful when it comes to overcoming scraping protections.

Captcha

The last-resort is to use a service to break the captcha. There are many services (mostly paid) online that do that. A popular one is DeathByCaptcha. Keep in mind that you may be breaking the website's terms of use, which I do not recommend :)

Aziz
  • 20,065
  • 8
  • 63
  • 69
  • This is not the case for me. With my VPN, I can open it without getting blocked. I would really not want to us a captcha service. Also, with tor, I am able to use the website fine. I am sure there is a way to do this since the page can be downloaded to pdf without getting blocked. – user2330624 Jan 02 '19 at 19:29
  • With cookies off or on? Because you won't be blocked with VPN if your cookies are on and the service can detect that you are a "returning" visitor. Try opening the URL in a private browser window. – Aziz Jan 02 '19 at 19:31
  • Also, just to clarify, this is the URL that is Captcha protected: https://seekingalpha.com/articles?page=10333 . The other URL (the actual article) opens with no problems. – Aziz Jan 02 '19 at 19:33
  • Sorry, I edited the post. that was supposed to be "/article/4230872-dillards-still-room-downside" – user2330624 Jan 02 '19 at 19:36
  • Oh I see. That URL works fine. I guess it may be a different reason. Do you have `referer` in your request headers? – Aziz Jan 02 '19 at 19:38
  • For the /articles/ pages a `referer` is needed. However, I believe there was no `referer` for /article/ pages. So no. Just so you know, I was able to make about 100 requests once....but then all of this started. I really have no idea what is going on here. It is very strange. – user2330624 Jan 02 '19 at 19:39
  • I have updated the answer with more details and other potential solutions :) – Aziz Jan 02 '19 at 19:59
  • Looking closely at the headers you posted, I see `authority`, `method`, `scheme`, `path`, etc. These are [HTTP/2 "pseudo-headers"](https://http2.github.io/http2-spec/#HttpRequest). `requests` supports only HTTP/1.1. You need to remove these headers – Aziz Jan 02 '19 at 20:23
  • thank you for all the help. So maybe by header is wrong? I used networks to see what SeekingAlpha uses. What should by header look like then? Am I reading your link wrong? Says "The following pseudo-header fields are defined for HTTP/2 requests:" includes the ones you said I should remove. – user2330624 Jan 02 '19 at 20:30
  • session does not help. – user2330624 Jan 02 '19 at 20:40
  • Yes, the link explains HTTP/2 headers. `requests` does **not** support HTTP/2. It only supports HTTP/1.1. So you need to remove these headers. Try using Developer Tools in Firefox (instead of Chrome), as it may generate cleaner headers. – Aziz Jan 02 '19 at 21:26
  • It may be helpful to share part of your code (where you have the requests loop). We may be able to spot something there – Aziz Jan 02 '19 at 21:28
  • Thank you Aziz. I used the header from firefox and sessions as you mentioned I should try. – user2330624 Jan 02 '19 at 23:17
  • Were you able to get this working? I am getting a 403 randomly. – MasayoMusic May 03 '20 at 05:46