0

Im trying to scrape the url and would like to get all apartments on all the pages possible. In this example, there are only two pages and I want to:

  1. "Click on the next button" / Go to page 2
  2. Go to the last page if there is no next-button

In the tutorials I've seen such as this one there is a href-link to the next button on the page he is scraping. In my case the HTML-code for the page-list does not contain any href link but looks like this:

enter image description here

In other tutorials, they could find the href link to the next-button by finding looping over the links in the webpage. When I do this, I get only the main links for the site (despite loading a nested url) and not find any next button.

nav = soup.nav

for url in nav.find_all('a'):
    print(url.get('href'))

Do you have any ideas on how to access the url of the next-button in this case?

karwi
  • 23
  • 3

2 Answers2

0

I'm not very familiar with all options of Beautiful soup in particular, but most likely that assumes there is a href in the link you are about to click, so this will not work. The page seems to be a SPA, maybe this next button is implemented with some java script (event listener for example) that requests data and updates the page.

It seems like most of the data is loaded in via API call to https://www.booli.se/graphql. Right click the page, click inspect, then see the network tab and click your next button. A request with parameters such as below is send in the request body:

{
  "operationName": "searchForSale",
  "variables": {
    "input": {
      "page": 2,
      "filters": [
        {
          "key": "objectType",
          "value": "Lägenhet,Parhus,Radhus,Kedjehus"
        },
        {
          "key": "rooms",
          "value": "3,2,4"
        },
        {
          "key": "minLivingArea",
          "value": "60"
        }
    },
 "query": "query searchForSale($input: SearchRequest) {\n  search: searchForSale(input: $input) {\n    pages\n    totalCount\n    result {\n      __typename\n      ... on Listing {\n        booliId\n        blockedImages\n        descriptiveAreaName\n        livingArea {\n          formatted\n          __typename\n        }\n        listPrice {\n          formatted\n          raw\n          __typename\n        }\n        listSqmPrice {\n          formatted\n          __typename\n        }\n        latitude\n        longitude\n        daysActive\n        primaryImage {\n          id\n          __typename\n        }\n        objectType\n        rent {\n          formatted\n          raw\n          __typename\n        }\n        operatingCost {\n          raw\n          __typename\n        }\n        estimate {\n          price {\n            raw\n            formatted\n            __typename\n          }\n          __typename\n        }\n        rooms {\n          formatted\n          __typename\n        }\n        streetAddress\n        url\n        isNewConstruction\n        biddingOpen\n        upcomingSale\n        mortgageDeed\n        tenureForm\n        plotArea {\n          formatted\n          __typename\n        }\n        patio\n        hasFireplace\n        __typename\n      }\n      ... on Project {\n        booliId\n        name\n        url\n        booliUrl\n        numberOfListingsForSale\n        location {\n          namedAreas\n          __typename\n        }\n        developer {\n          name\n          id\n          __typename\n        }\n        image {\n          id\n          __typename\n        }\n        latestPriceChange\n        latitude\n        longitude\n        created\n        roomsList\n        livingAreaRange\n        listPriceRange\n        lowestProjectListPrice\n        __typename\n      }\n    }\n    __typename\n  }\n}\n"
}

Instead of trying to go to the next page, you could send this request directly and use the response data, adapting the variables in your request to your needs. So if needed you can increase the page here (or possible even get all data without needing to click through pages).

Edit: The API for me doesn't require any form of authentication, I send a POST request with body to https://www.booli.se/graphql, here is an example in postman: enter image description here

4Fingers
  • 89
  • 6
  • Hi, thank you for your comment. I dont exactly understand what you want me to try, I checked the API and its closed for new developers, so I am not able to access it (unfortunately). – karwi Sep 29 '21 at 12:13
  • There is no url (and thus href) to another page, instead the data is loaded in using an request. I added an example of how I retrieved data from the api directly using postman. I also updated my old explanation a bit, hope this helps – 4Fingers Sep 30 '21 at 13:39
0

You can select any element part of the html tree using CSS and/or XPath selectors.

I kindly link the stack overflow post that goes into detail how to do it in beautiful soup.

TLDR: use from lxml import etree to parse and support XPath

Xpath could be something like //div/button[@class='_132gv'] or any other specific thing look up XPath syntax

You can always count the number of elements, check presence etc, especially with XPath which supports even much more.

Warkaz
  • 845
  • 6
  • 18
  • Thank you for the guidance, this is little advanced for me yet but Ill dig more into the post with lxml, find_next and xpath selectors to see if this can solve the problem. – karwi Sep 29 '21 at 12:49
  • No problem, you can always check the xpath syntax I linked and practice on any website using chrome for example using one of 2 ways in chrome for example (probably other browsers as well): `F12 -> Console -> $x("")` example to give you all divs on the page is as follows `$x("//div")` . `F12 -> Elements -> Click on the HTML and press CTL + F` and now search directly with your xpath/css while having highlighting – Warkaz Sep 30 '21 at 13:31