54

We're using a curl HEAD request in a PHP application to verify the validity of generic links. We check the status code just to make sure that the link the user has entered is valid. Links to all websites have succeeded, except LinkedIn.

While it seems to work locally (Mac), when we attempt the request from any of our Ubuntu servers, LinkedIn returns a 999 status code. Not an API request, just a simple curl like we do for every other link. We've tried on a few different machines and tried altering the user agent, but no dice. How do I modify our curl so that working links return a 200?

A sample HEAD request:

curl -I --url https://www.linkedin.com/company/linkedin

Sample Response on Ubuntu machine:

HTTP/1.1 999 Request denied
Date: Tue, 18 Nov 2014 23:20:48 GMT
Server: ATS
X-Li-Pop: prod-lva1
Content-Length: 956
Content-Type: text/html

To respond to @alexandru-guzinschi a little better. We've tried masking the User Agents. To sum up our trials:

  • Mac machine + Mac UA => works
  • Mac machine + Windows UA => works
  • Ubuntu remote machine + (no UA change) => fails
  • Ubuntu remote machine + Mac UA => fails
  • Ubuntu remote machine + Windows UA => fails
  • Ubuntu local virtual machine (on Mac) + (no UA change) => fails
  • Ubuntu local virtual machine (on Mac) + Windows UA => works
  • Ubuntu local virtual machine (on Mac) + Mac UA => works

So now I'm thinking they block any curl requests that dont provide an alternate UA and also block hosting providers?

Is there any other way I can check if a link to linkedin is valid or if it will lead to their 404 page, from an Ubuntu machine using PHP?

charltoons
  • 1,951
  • 2
  • 18
  • 24

4 Answers4

24

It looks like they filter requests based on the user-agent:

$ curl -I --url https://www.linkedin.com/company/linkedin | grep HTTP
HTTP/1.1 999 Request denied

$ curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" -I --url https://www.linkedin.com/company/linkedin | grep HTTP
HTTP/1.1 200 OK
Alexandru Guzinschi
  • 5,675
  • 1
  • 29
  • 40
  • We tried altering the user agent, though. So our responses have been: [Mac machine + Mac UA => works] [Mac machine + Windows UA => works] [Ubuntu machine + Ubuntu UA => fails] [Ubuntu machine + Mac UA => fails] [Ubuntu machine + Windows UA => fails] No access to a windows machine at the moment, so I'm sure about that. – charltoons Dec 03 '14 at 20:03
  • 1
    @charltoons That is strange, because I tried right now with the current UA of Chrome `curl -A "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36" -I --url https://www.linkedin.com/company/linkedin | grep HTTP` which gives me a `HTTP/1.1 200 OK` from my Ubuntu. Maybe you tried with an old (*or incorrect*) UA which they block ? Run a new test with the UA that I used. – Alexandru Guzinschi Dec 04 '14 at 06:54
  • That works on my virtual machine, but fails on remote ones/servers. See above for the full trial matrix. May I ask are you're trying from a remote machine, and if so, what provider? – charltoons Dec 05 '14 at 18:23
  • @charltoons No, the tests were made from my local machine. If you are sure that you (*or some "neighbor", if you are sharing an IP*) did not make enough requests so you could be throttled (*those are cleared after 24 hours, if I remember correctly*), most likely they have some restrictions in place for your IP range. – Alexandru Guzinschi Dec 05 '14 at 20:40
  • 2
    They filter both user agent AND ip address. So you need some kind of valid proxy address. – olefrank Jun 05 '15 at 07:05
  • 1
    This answer may be correct but it's not really helpful in trying to figure out how to do link checking for linkedin URLs. Providing a fake User-Agent is not something I would like to do or recommend to others. I think bots and link checkers should correctly identify themselves and provide contact information. That is what I do with my link checkers. – Sybille Peters Sep 16 '21 at 14:48
  • I'm getting `HTTP 999` with the user agent header as well – Parzival Nov 15 '22 at 22:15
  • FWIW, I get 999 on my home machine, regardless of being on a VPN or not, checking the URL from within a Word macro link checker. The link works fine if I click on it inside the Word document. – Kevin Apr 19 '23 at 13:58
14

I found the workaround, important to set accept-encoding header:

curl --url "https://www.linkedin.com/in/izman" \
--header "user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36" \
--header "accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" \
--header "accept-encoding:gzip, deflate, sdch, br" \
| gunzip
Andrey Izman
  • 1,807
  • 1
  • 24
  • 27
5

Seems like LinkedIn filter both user agent AND ip address. I tried this both at home and from an Digital Ocean node:

curl -A "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" -I --url https://www.linkedin.com/company/linkedin

From home I got a 200 OK, from DO I got 999 Denied...

So you need a proxy service like HideMyAss or other (haven't tested it so I couldn't say if it's valid or not). Here is a good comparison of proxy services.

Or you could setup a proxy on your home network, for example use a Raspberry PI to proxy your requests. Here is a guide on that.

olefrank
  • 6,452
  • 14
  • 65
  • 90
  • 1
    A proxy is a viable solution for small projects, but unfortunately this is for a larger web application. We verify thousands of links per hour this way. We're not going to be able to proxy all of those requests I'm afraid. Plus, LinkedIn urls account for only a small fraction of them. – charltoons Jun 06 '15 at 17:43
  • Proxy alone wouldn't help. We've tried a HMA proxy, but LinkedIn still blocks URLs to profiles even from actual Chrome. After changing IP, clearing all cookies and history in FireFox and requesting some other profile, LI still responded with 999 and redirected to login page. Perhaps they know and block HMA IP ranges? – Denis Stepanenko Dec 23 '16 at 21:57
4

Proxy would work, but I think there's another way around it. I see that from AWS and other clouds that it's blocked by IP. I can issue the request from my machine and it works just fine.

I did notice that in the response from the cloud service that it returns some JS that the browser has to execute to take you to a login page. Once there, you can login and access the page. The login page is only for those accessing via a blocked IP.

If you use a headless client that executes JS, or maybe go straight to the subsequent link and provide the credentials of a linkedin user, you may be able to bypass it.

dmarlow
  • 417
  • 7
  • 13
  • Tried this. After about 20 logins, you'll get a 'We're getting things cleaned up. We'll be back' message after login. – olive_tree Oct 18 '16 at 21:23