How to scrape data from a website when you need to log in?

Question

I’m a complete noob and trying to scrape data for the first time. I’ve watched some video’s and read a bunch of articles to learn how to scrape data. The code I’ve written so far is this:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://mijn.makelaarsland.nl/aanbod/kaart'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parser
page_soup = soup(page_html, "html.parser")
page_soup.body.div

The problem when trying to parse data is that I get this problem:

<div class="login-background"></div>

I’ve watched a bunch of videos of videos and tried to write some code to get it all working but I don’t understand it. Maybe someone can help me and tell me what I do wrong.

Here might be some useful information:

This is the log in URL:
LOGIN_URL = "https://mijn.makelaarsland.nl/inloggen"


content-type: application/x-www-form-urlencoded

An overview of the network page when I right-click on 'Inspect'

I suggest using the `requests` package, logging in (with the help of [this previous SO answer](https://stackoverflow.com/a/17633072/5666087), and then parsing the html of the page you need with beautifulsoup — jkr, May 04 '20 at 13:50

score 0 · Answer 1 · answered May 04 '20 at 14:02

0

As I wrote in my comment, I suggest using the requests python package. That package has great documentation, and you can find many tutorials online. Log into the website in the scope of a requests.Session(), navigate to the page you want, and then scrape with beautifulsoup.

Here is a code sample adapted from https://stackoverflow.com/a/17633072/5666087

import requests

# Fill in your details here to be posted to the login form.
payload = {
    "MyAccount.Username": "username",
    "MyAccount.Password": "password"
}

# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    p = s.post("https://mijn.makelaarsland.nl/inloggen", data=payload)
    # An authorized request.
    r = s.get("https://mijn.makelaarsland.nl/aanbod/kaart")
    print("status code:", r.status_code)
    page_soup = soup(r.text, "html.parser")
    page_soup.body.div

answered May 04 '20 at 14:02

jkr

17,119
2
42
68

Thanks. Tried using the ```requests``` package. However, I think the websites uses AJAX content. I read something about. it says: ```

``` Do you have any experience with this?
– MatthiasR May 04 '20 at 15:12
I don't think that should matter. You can still send post the form data using the `session.post` method in requests, as I wrote in my answer. – jkr May 04 '20 at 15:18
Thanks! Tried using the requests package and it seems to work. It gives back: ```status code: 200 Traceback (most recent call last): File "", line 6, in File "C:\Users\matth\Anaconda3\lib\site-packages\bs4\element.py", line 1578, in __getattr__ "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key AttributeError: ResultSet object has no attribute 'body'. You're probably treating a list of items like a single item.```. – MatthiasR May 04 '20 at 15:20
That error is related to beautifulsoup now. I'm afraid I can't help with that. – jkr May 04 '20 at 15:23

score 0 · Answer 2 · answered May 05 '20 at 07:45

I should have fixed the BeautifulSoup problem. In addition I think I need to add the _RequestVerificationToken.

import requests
from bs4 import BeautifulSoup

headers = {"user-agent" : "Mozilla/5.0 ... etc."
          }

login_data = {
    "MyAccount.Username": "myusername",
    "MyAccount.Password": "mypassword",
    "RembemberMe" : "false"
}


with requests.Session() as s:
    url = 'https://mijn.makelaarsland.nl/inloggen?ReturnUrl=%2faanbod%2fkaart'
    r = s.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
    r = s.post(url, data=login_data, headers=headers)

    print(r.content)

However it returns:

TypeError                                 Traceback (most recent call last)
<ipython-input-52-5509032e4ad3> in <module>
     16     r = s.get(url, headers=headers)
     17     soup = BeautifulSoup(r.content, 'html.parser')
---> 18     login_data[_RequestVerificationToken] = soup.find('input', attrs={'name' : '_RequestVerificationToken'})['value']
     19     r = s.post(url, data=login_data, headers=headers)
     20 

TypeError: 'NoneType' object is not subscriptable

What do I do wrong here?

Managed to log in. Thanks for your help. The reason it didn't work was because I forgot the underscore "_" — MatthiasR, May 05 '20 at 13:42

How to scrape data from a website when you need to log in?

2 Answers2