9

I've written a script in python to log in to a website and parse the username to make sure I've really been able to log in. Using the way I've tried below seems to get me there. However, I've used hardcoded cookies taken from chrome dev tools within the script to get success.

I've tried with:

import requests
from bs4 import BeautifulSoup

url = 'https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fap-signin-handler&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0'
signin = 'https://secure.imdb.com/ap/signin'
mainurl = 'https://pro.imdb.com/'

with requests.Session() as s:
    res = s.get(url,headers={"User-agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['email'] = 'some username'
    payload['password'] = 'some password'

    s.post(signin,data=payload,headers={
        "User-agent":"Mozilla/5.0",
        "Cookie": 'adblk=adblk_yes; ubid-main=130-2884709-6520735; _msuuid_518k2z41603=95C56F3B-E3C1-40E5-A47B-C4F7BAF2FF5D; _fbp=fb.1.1574621403438.97041399; pa=BCYm5GYAag-hj1CWg3cPXjfv2X6NGPUp6kLguepMku7Yf0W9-iSTjgmVNGmQLwUfJ5XJPHqlh84f%0D%0Agrd2voq0Q7TR_rdXU4T1BJw-1a-DdvCNSVuWSm50IXJDC_H4-wM_Qli_%0D%0A; uu=BCYnANeBBdnuTg3UKEVGDiO203C7KR0AQTdyE9Y_Y70vpd04N5QZ2bD3RwWdMBNMAJtdbRbPZMpG%0D%0AbPpC6vZvoMDzucwsE7pTQiKxY24Gr4_-0ONm7hGKPfPbMwvI1NYzy5ZhTIyIUqeVAQ7geCBiS5NS%0D%0A1A%0D%0A; session-id=137-0235974-9052660; session-id-time=2205351554; session-token=jsvzgJ4JY/TCgodelKegvXcqdLyAy4NTDO5/iEvk90VA8qWWEPJpiiRYAZe3V0EYVFlKq590mXU0OU9XMbAzwyKqXIzPLzKfLf3Cc3k0g/VQNTo6roAEa5IxmOGZjWrJuhkRZ1YgeF5uPZLcatWF1y5PFHqvjaDxQrf2LZbgRXF5N7vacTZ8maK0ciJmQEjh; csm-hit=tb:8HH0DWNBDVSWP881GYKG+s-8HH0DWNBDVSWP881GYKG|1574631571950&t:1574631571952&adb:adblk_yes'
        })

    r = s.get(mainurl,headers={
        "Cookie": 'adblk=adblk_yes; ubid-main=130-2884709-6520735; _msuuid_518k2z41603=95C56F3B-E3C1-40E5-A47B-C4F7BAF2FF5D; _fbp=fb.1.1574621403438.97041399; pa=BCYm5GYAag-hj1CWg3cPXjfv2X6NGPUp6kLguepMku7Yf0W9-iSTjgmVNGmQLwUfJ5XJPHqlh84f%0D%0Agrd2voq0Q7TR_rdXU4T1BJw-1a-DdvCNSVuWSm50IXJDC_H4-wM_Qli_%0D%0A; csm-hit=tb:KV47B1QVKP4DNB3QGY95+b-NM69W1Y35R7ARV0639V5|1574631544432&t:1574631544432&adb:adblk_yes; session-id=137-0235974-9052660; session-id-time=2205351554; session-token="EsIzROiSTmFDfXd5jnBPIBOpYG9jAu7tiWXDF8R52sUw5jS6OjddfOOQB+ytCmq0K3UnXs9wKBvQtkB4aVNsXieVbRcIUrKf3iPnYeJchbOlShMjg+MR+O7IQgPKkw0BKihdYQ1YIl7KQS8VeLxZjtzJ5sj5ocnY72fCKdwq/fGOjfieFYbe9Km3a8h++1GpC738JbwcVdpTG08v1pjhQKifqPQXnqhcyVKhi8CD1qk="; x-main="C1KbtQgFFBAYfwttdRSrU5CpCe@Fn6SPHnBTY6dO2ppimt@u1P1L7G0PueQMn6X3"; at-main=Atza|IwEBICfS3UKNp2mwmbyUPY1QzjXRHMcL6fjv2ND7BDXsZ1G-qDPJKsLJXeU9gJOvRpWsofSpOJCyhnap-bIOWCutU6VMIS9bn3UkNVRP8WFVqrs-CLB5opLbrEx6YxVGQlfaxx54gzuuGO4D30z-AgBpGe64_bn0K1iLOT3P3i7S3nBzvP_0AopwKlbU7SRnE5m21cVfVK7bwbtfZO4cf7DrpGcaHK4dlY5jKHPzNx_AR4ypqsEBFbHon36N1j8foty6wLJhFP1gNCvs24mVCec24TRho5ZXFDYqhLB-dw9V3XY1eq7q1QNgtAdYkDSJ6Mq1nllFu59WqIVs1Y3lLEaxDUExLtCt-VQArpS_hZtZR8C_kevhV01jEhWg8RUQaCdYTMwZHwa778MiEOrrrdGqFnR5; sess-at-main="tWwUfkZLx+mDAPqZo+J6yJlnjqBJvYJ0oVMS6/NcIKQ="; id=BCYhnxuM-3g3WFo4uvCv6C5LdGLJKaIcZj8E-rQwU_YsF991I3Tqe94W6IlU27FvaNcnuCyv5Te3%0D%0A0c3O1mMYhEE14wMdByo2SvGXkBS0A4oFMJMEIe0aC1X4fyNRwWYNZ72a6NDzAOqeDQi3_7sZZGH8%0D%0AxQ%0D%0A; uu=BCYsGSOaee6VbhMOMXpG3F_6i7cTIkPCN0S0_Jv7c3bVkUQ5gp9vqtfvVlOMOIOqXv-uHSTSibBp%0D%0ATO1e4tRpT1DolY2qkoOW8yICF7ZrXqAgont_ShTy8zVEg1wxWCxg3_XQX8r8_dGFCO4NWZiyLH-f%0D%0A2RpBF2IJLUSd8R4UCbbbtgo%0D%0A; sid=BCYp9inRAYR9sJgmF1FcA9Vgto81vmiCYHP_gEVv6r2ZdBtz1bKtOQg4_0iSwREudsZrPM8SHMUk%0D%0A5jFMp74veGrdwNTf8DONXPUCExLgkHzfeoZr-KHf4VbI7aI5TrJhqSioYbEhHYqm6q5RGrXfCVPr%0D%0AqA%0D%0A'
        })

    sauce = BeautifulSoup(r.text,"lxml")
    name = sauce.select_one("span.display-name").text
    print(name)

I've tried with the following to see if it works to avoid using hardcoded cookies but unfortunately it failed:

cookie_string = "; ".join([str(x)+"="+str(y) for x,y in s.cookies.items()])

This is how I tried automatically:

cookie_string = "; ".join([str(x)+"="+str(y) for x,y in s.cookies.items()])
s.post(signin,data=payload,headers={
    "User-agent":"Mozilla/5.0",
    "Cookie": cookie_string
    })
cookie_string_ano = "; ".join([str(x)+"="+str(y) for x,y in s.cookies.items()])
r = s.get(mainurl,headers={
    "Cookie": cookie_string_ano
    })

When I tried using above I can see that cookie_string,cookie_string_ano are producing session-id=130-0171771-5726549; session-id-time=2205475101l and session-id=130-0171771-5726549; session-id-time=2205475101l; ubid-main=135-8050026-6353151.

How can I fetch the username without using hardcoded cookies within the script?

MITHU
  • 113
  • 3
  • 12
  • 41
  • what do you see when you run `print(s.cookies.items())`? Are you sure that you are getting all of the necessary cookies from `s.get(url)`? – Simas Joneliunas Nov 26 '19 at 07:14
  • When I print that I can only see `session-id` and `session-id-time` and their values in the cookies but in reality there are many more in the hardcoded ones @Simas Joneliunas. – MITHU Nov 26 '19 at 07:17
  • Are you sure that your "login" seems real enough? Maybe imdb detects that "something is wrong" and does not return all of the cookies. Maybe they use other javascript files that set the remainder of the cookies. I suggest trying to perform the same login using selenium and see if you can get more cookies than through requests. – Simas Joneliunas Nov 26 '19 at 07:33
  • Please see the edit @Simas Joneliunas. – MITHU Nov 26 '19 at 07:53
  • I didn't have much success either. I tried with a real user-agent, referer and origin, but it always returns the login page with a captcha. I suspect that's because of a `metadata1` parameter, which seems to be js generated, but I can't be sure. – t.m.adam Nov 26 '19 at 15:41
  • @MITHU - Could you use one of these methods here without setting your own cookie https://stackoverflow.com/questions/189555/how-to-use-python-to-login-to-a-webpage-and-retrieve-cookies-for-later-usage? Or do you need to set your own cookie? – Kevin Ng Nov 30 '19 at 06:01
  • You might need to get all request header, because some attiributes are mandatory such as accept, accepted-header, encoding etc. Better to get all static request headers. – furkanayd Nov 30 '19 at 22:47
  • The session **already adds the Cookie header** from the values in `s.cookies`. Why do you feel you need to add them manually? – Martijn Pieters Dec 03 '19 at 22:10
  • What I think is happening is that those pages run JS code that set additional cookies. Those are not set by your (redundant) code to copy cookies from the session to the new request. You only need to supply those missing cookies, but what values to use may not be obvious without reverse engineering work on the JS code the site uses. – Martijn Pieters Dec 03 '19 at 22:14
  • I believe some of these cookies are set by bot-prevention software on imdb site. So it is supposed that answer to your question is you "cannot". Because if you can, it means you broken anti-bot protection, since your script is effectively a bot. – Konstantin Svintsov Dec 04 '19 at 10:58

2 Answers2

3

To fetch cookies from Chrome dev tools, there is a need to interact with Google Chrome using Chrome DevTools Protocol within a Python script.

Here is a python plugin that gives you the privilege to get cookies. This will help you to overcome the issue related to hard-coded cookies. Visit Reference : PyChromeDevTools.


Remember: Screen scraping is explicitly forbidden by the IMDb. Visit Reference IMDb Conditions of Use as given here that;

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express wrote consent as noted below.


Prerequisites:

  • For this, you first have to set chrome path in system environment variables.

  • After this, you must run an instance of Google Chrome with the remote-debugging option - visit-reference: Remote debugging with Chrome Developer Tools.

  • Use the following command in command-prompt or terminal to run the instance as given;

    chrome.exe --remote-debugging-port=9222 --user-data-dir=remote-profile


Workaround:

After running Google instance then you can run this program like in the following example.

import time
import requests
import PyChromeDevTools
from bs4 import BeautifulSoup

url = 'https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fap-signin-handler&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0'
signin = 'https://secure.imdb.com/ap/signin'
mainurl = 'https://pro.imdb.com/'


def parse_cookies(input_url):
    chrome = PyChromeDevTools.ChromeInterface()
    chrome.Network.enable()
    chrome.Page.enable()
    chrome.Page.navigate(url=input_url)
    time.sleep(2)

    cookies = chrome.Network.getCookies()

    return cookies["result"]["cookies"]


def get_cookies(parsed_cookie_string):
    cookie_names = [sub_cookie['name'] for sub_cookie in parsed_cookie_string]
    cookie_values = [sub_cookie['value'] for sub_cookie in parsed_cookie_string]

    cookie_string = "; ".join([str(x) + "=" + str(y) for x, y in zip(cookie_names, cookie_values)])

    return cookie_string


with requests.Session() as s:
    res = s.get(url, headers={"User-agent": "Mozilla/5.0"})
    soup = BeautifulSoup(res.text, "lxml")
    payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
    payload['email'] = 'some username'
    payload['password'] = 'some password'

    cookie_string_for_post = parse_cookies(signin)
    print("Cookies for Post Request:\n ", cookie_string_for_post)

    cookie_string_for_get = parse_cookies(mainurl)
    print("Cookies for Get Request:\n ", cookie_string_for_get)

    post_req_cookies = get_cookies(cookie_string_for_post)
    print("Post Cookie_String:\n ", post_req_cookies)

    get_req_cookies = get_cookies(cookie_string_for_get)
    print("Get Cookie_String:\n ", get_req_cookies)

    s.post(signin, data=payload, headers={
        "User-agent": "Mozilla/5.0",
        "Cookie": post_req_cookies
    })

    r = s.get(mainurl, headers={
        "Cookie": get_req_cookies
    })

    sauce = BeautifulSoup(r.text, "lxml")
    name = sauce.select_one("span.display-name").text
    print("User-Name:", name)

In the above script, I have maintained two methods:

  • parse_cookies(input_url) # To Parse Cookies from IMDB before and after sign-in
  • get_cookies(parsed_cookie_string) # To do slicing for { name=values; } pattern

Here are results from above script;

Cookies for Post Request:
  [{'name': 'csm-hit', 'value': 'adb:adblk_no&t:1575551929829', 'domain': 'secure.imdb.com', 'path': '/', 'expires': 1636031929, 'size': 35, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'session-token', 'value': 'ojv7WWBxadoA7dlcquiw9uErP2rhrTH7rHbpVhoRy4T+qTDfhwZKdDt5jOeGfZp1TKvwtzTGuJ6pOltjNFPiIuP5Rd5Vw8/e1J3RY/iye5tEh7qoRC2NHF9wc003xKG3PPAAdmgf8/mv8GeLAOOKNgWKBTUeMre9xbj5GzXxZBPdXMZttHrMYqKKSuwWLpa0', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035367.931534, 'size': 205, 'httpOnly': True, 'secure': True, 'session': False}, {'name': '_msuuid_518k2z41603', 'value': '7EFA48D9-B808-4A94-AF25-DF946D700AE7', 'domain': '.imdb.com', 'path': '/', 'expires': 1607087673, 'size': 55, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'uu', 'value': 'BCYrG0JCGIzGSiHxLJnhMiZmYPKjX1M_R2SYqoaFp8H_0KTtNvuGu-u_h_WO9yjlPz2CTdiUs86i%0D%0Az7kP7F-mJu5OZVpOKhquJmQf7Ks8_flkk2XlZzTPnz7R4WTBpqeRfxQqr0M9q54Gvnd0f5s1lajr%0D%0AVA%0D%0A', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.37521, 'size': 174, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'ubid-main', 'value': '130-4270133-5864707', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035317.315112, 'size': 28, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'adblk', 'value': 'adblk_no', 'domain': '.imdb.com', 'path': '/', 'expires': 1607087639, 'size': 13, 'httpOnly': False, 'secure': False, 'session': False}, {'name': '_fbp', 'value': 'fb.1.1575551679007.40322953', 'domain': '.imdb.com', 'path': '/', 'expires': 1583327724, 'size': 31, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'session-id', 'value': '130-3480383-2108806', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.375339, 'size': 29, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'session-id-time', 'value': '2206271615', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.375396, 'size': 25, 'httpOnly': False, 'secure': True, 'session': False}]
Cookies for Get Request:
  [{'name': 'vuid', 'value': 'pl1203459194.1031556308', 'domain': '.vimeo.com', 'path': '/', 'expires': 1638623938, 'size': 27, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'session-token', 'value': 'ojv7WWBxadoA7dlcquiw9uErP2rhrTH7rHbpVhoRy4T+qTDfhwZKdDt5jOeGfZp1TKvwtzTGuJ6pOltjNFPiIuP5Rd5Vw8/e1J3RY/iye5tEh7qoRC2NHF9wc003xKG3PPAAdmgf8/mv8GeLAOOKNgWKBTUeMre9xbj5GzXxZBPdXMZttHrMYqKKSuwWLpa0', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035367.931534, 'size': 205, 'httpOnly': True, 'secure': True, 'session': False}, {'name': '_msuuid_518k2z41603', 'value': '7EFA48D9-B808-4A94-AF25-DF946D700AE7', 'domain': '.imdb.com', 'path': '/', 'expires': 1607087673, 'size': 55, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'uu', 'value': 'BCYrG0JCGIzGSiHxLJnhMiZmYPKjX1M_R2SYqoaFp8H_0KTtNvuGu-u_h_WO9yjlPz2CTdiUs86i%0D%0Az7kP7F-mJu5OZVpOKhquJmQf7Ks8_flkk2XlZzTPnz7R4WTBpqeRfxQqr0M9q54Gvnd0f5s1lajr%0D%0AVA%0D%0A', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.37521, 'size': 174, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'ubid-main', 'value': '130-4270133-5864707', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035317.315112, 'size': 28, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'adblk', 'value': 'adblk_no', 'domain': '.imdb.com', 'path': '/', 'expires': 1607087639, 'size': 13, 'httpOnly': False, 'secure': False, 'session': False}, {'name': '_fbp', 'value': 'fb.1.1575551679007.40322953', 'domain': '.imdb.com', 'path': '/', 'expires': 1583327724, 'size': 31, 'httpOnly': False, 'secure': False, 'session': False}, {'name': 'session-id', 'value': '130-3480383-2108806', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.375339, 'size': 29, 'httpOnly': False, 'secure': True, 'session': False}, {'name': 'session-id-time', 'value': '2206271615', 'domain': '.imdb.com', 'path': '/', 'expires': 3723035262.375396, 'size': 25, 'httpOnly': False, 'secure': True, 'session': False}]
Post Cookie_String:
  csm-hit=adb:adblk_no&t:1575551929829; session-token=ojv7WWBxadoA7dlcquiw9uErP2rhrTH7rHbpVhoRy4T+qTDfhwZKdDt5jOeGfZp1TKvwtzTGuJ6pOltjNFPiIuP5Rd5Vw8/e1J3RY/iye5tEh7qoRC2NHF9wc003xKG3PPAAdmgf8/mv8GeLAOOKNgWKBTUeMre9xbj5GzXxZBPdXMZttHrMYqKKSuwWLpa0; _msuuid_518k2z41603=7EFA48D9-B808-4A94-AF25-DF946D700AE7; uu=BCYrG0JCGIzGSiHxLJnhMiZmYPKjX1M_R2SYqoaFp8H_0KTtNvuGu-u_h_WO9yjlPz2CTdiUs86i%0D%0Az7kP7F-mJu5OZVpOKhquJmQf7Ks8_flkk2XlZzTPnz7R4WTBpqeRfxQqr0M9q54Gvnd0f5s1lajr%0D%0AVA%0D%0A; ubid-main=130-4270133-5864707; adblk=adblk_no; _fbp=fb.1.1575551679007.40322953; session-id=130-3480383-2108806; session-id-time=2206271615
Get Cookie_String:
  vuid=pl1203459194.1031556308; session-token=ojv7WWBxadoA7dlcquiw9uErP2rhrTH7rHbpVhoRy4T+qTDfhwZKdDt5jOeGfZp1TKvwtzTGuJ6pOltjNFPiIuP5Rd5Vw8/e1J3RY/iye5tEh7qoRC2NHF9wc003xKG3PPAAdmgf8/mv8GeLAOOKNgWKBTUeMre9xbj5GzXxZBPdXMZttHrMYqKKSuwWLpa0; _msuuid_518k2z41603=7EFA48D9-B808-4A94-AF25-DF946D700AE7; uu=BCYrG0JCGIzGSiHxLJnhMiZmYPKjX1M_R2SYqoaFp8H_0KTtNvuGu-u_h_WO9yjlPz2CTdiUs86i%0D%0Az7kP7F-mJu5OZVpOKhquJmQf7Ks8_flkk2XlZzTPnz7R4WTBpqeRfxQqr0M9q54Gvnd0f5s1lajr%0D%0AVA%0D%0A; ubid-main=130-4270133-5864707; adblk=adblk_no; _fbp=fb.1.1575551679007.40322953; session-id=130-3480383-2108806; session-id-time=2206271615
User-Name: **Logged in user-name**
Muhammad Usman Bashir
  • 1,441
  • 2
  • 14
  • 43
0

Seems like you are copying the cookies from browser, so here i'll go with this theory.

The first post api you hit, sets some cookies, returns a page, which calls some further urls, which set more cookies, and this goes on. Try checking all the requests in the network tab to see if there are multiple calls, which set different cookies.

If there are, you need to call all of them in the order they are called in the page, each call adding new cookies, and then, finally you should be able to see all the cookies that you are copying.

However, if a random data is being calculated and sent in any of the calls, it might be for csrf protection or bot protection, in which case, you are better off using http://www.omdbapi.com/ or https://imdbpy.github.io/ to access official APIs instead of internal ones.

Himanshu Mishra
  • 310
  • 3
  • 13
  • I can only get the desired content or get logged in once I follow the steps how xhr are being sent. That is old news. I exactly did it in my above script. As for your api suggestion, you might have missed this information that I'm already able to log in that site and can access their data without any issue. What I can't do is get success without using hardcoded cookies and that is what my question is about. Thanks. – MITHU Dec 04 '19 at 15:33