0

Usually I've been able to get around 403 Errors once I've added a known User Agent but I'm now trying to login and then eventually scrape and cannot figure out how to bypass this error.

Code:

import urllib
import http.cookiejar

cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
authentication_url = 'https://www.linkedin.com/'
payload = {
    'session_key': 'email',
    'session_password': 'password'
}
data = urllib.parse.urlencode(payload)
binary_data = data.encode('UTF-8')
req = urllib.request.Request(authentication_url, binary_data)
resp = urllib.request.urlopen(req)
contents = resp.read()

Traceback:

    Traceback (most recent call last):
  File "C:/Python34/loginLinked.py", line 16, in <module>
    resp = urllib.request.urlopen(req)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 469, in open
    response = meth(req, response)
  File "C:\Python34\lib\urllib\request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python34\lib\urllib\request.py", line 507, in error
    return self._call_chain(*args)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
    result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
joshbenner851
  • 111
  • 1
  • 12
  • scraping linkedin profiles is against its [user agreement](https://www.linkedin.com/legal/user-agreement#dispute-resolution) and can/will get your account terminated. That said, many login forms contain other (hidden) input fields that also need to be sent. – mata Aug 24 '15 at 20:02
  • To cut the long story short: they don't want you to do that. It's forbidden, so the error message is most accurate. Use the Linkedin API instead. – Klaus D. Aug 24 '15 at 20:26
  • Unfortunately their API doesn't allow you to grab all the members in a group or do anything outside your profile which renders it mostly useless – joshbenner851 Aug 24 '15 at 20:45
  • Looks similar to [urllib2.HTTPError: HTTP Error 403: Forbidden](https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden/46213623#46213623) – Supreet Sethi Nov 06 '17 at 18:00

1 Answers1

1

See my answer to this question:

why isn't Requests not signing into a website correctly?

I should start with stating that you really should use their API: http://developer.linkedin.com/apis

There does not seem to be any POST login on the frontpage of linkedin using those parameters?

This is the login URL you must POST to: https://www.linkedin.com/uas/login-submit

Be aware that this probably wont work either, as you need at least the csrfToken parameter from the login form.

You probably need the loginCsrfParam too, also from the login form on the frontpage.

Something like this might work. Not tested, you might need to add the other POST parameters.

import requests
s = requests.session()

def get_csrf_tokens():
    url = "https://www.linkedin.com/"
    req = s.get(url).text

    csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
    login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]

    return csrf_token, login_csrf_token


def login(username, password):
    url = "https://www.linkedin.com/uas/login-submit"
    csrfToken, loginCsrfParam = get_csrf_tokens()

    data = {
        'session_key': username,
        'session_password': password,
        'csrfToken': csrfToken,
        'loginCsrfParam': loginCsrfParams
    }

    req = s.post(url, data=data)

login('username', 'password')
Community
  • 1
  • 1
scandinavian_
  • 2,496
  • 1
  • 17
  • 19