0

I would like to have all the links for the hotel of this link : https://www.french.hostelworld.com/s?q=Paris,%20Ile-de-France,%20France&country=France&city=Paris&type=city&id=14&from=2021-04-30&to=2021-05-03&guests=2&page=1

Something like that, a list : ['https://www.french.hostelworld.com/pwa/hosteldetails.php/R-sidence-Internationale-de-Paris/Paris/294403?from=2021-04-30&to=2021-05-03&guests=2', 'https://www.french.hostelworld.com/pwa/hosteldetails.php/Le-Village-Montmartre-by-Hiphophostels/Paris/606?from=2021-04-30&to=2021-05-03&guests=2'...]

Here's my script :

import numpy as np


from time import sleep
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re



url = 'https://www.french.hostelworld.com/s?q=Paris,%20Ile-de-France,%20France&country=France&city=Paris&type=city&id=14&from=2021-04-30&to=2021-05-03&guests=2&page=1'

links1 = []

results = requests.get(url)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "page-inner"}).find_all('a', href=True)]


print(links1)

I obtained this :

(base) C:\Users\evanalonso\PYTHON\webscraping script>python hostelworld.py
['/']

Something wrong but I cannot figure it out, any ideas ?

FalconMelee
  • 39
  • 1
  • 7
  • page is taking time to load, and the response you are getting does not contain any link. that's why the list is empty. as an alternative, you can go to web driver, something [like this](https://stackoverflow.com/questions/47730671/python-3-using-requests-does-not-get-the-full-content-of-a-web-page/47730866) – simpleApp Apr 29 '21 at 14:21
  • What ? For me it load instantly and I have href in the code – FalconMelee Apr 29 '21 at 14:25
  • try to save your rendered html `with open("html_received.html","w") as f: f.write(results.text)` to see what has been feed to soup.thx – simpleApp Apr 29 '21 at 14:30

1 Answers1

0

You're trying to retrieve all the links from inside div element:

links1 = [a['href']  for a in soup.find("div", {"class": "page-inner"}).find_all('a', href=True)]

Take a look at the content of that element. If we run the following code:

div = soup.find("div", {"class": "page-inner"})
print(div)

We see:

<div class="page-inner">
  <header data-v-0225089c="">
    <a class="logo nuxt-link-active" data-v-0225089c="" href="/"
    title="Hostelworld">
      <img alt="Hostelworld" data-v-0225089c=""
      src="/_nuxt/img/789e4da.svg" />
    </a>
    <div class="header-icons" data-v-0225089c="">
      <div class="user-authentication item" data-v-0225089c="">
        <button aria-label="Se connecter/cr&#195;&#169;er un compte"
        class="header-option" data-v-0225089c="" id="header-login"
        title="Se connecter/cr&#195;&#169;er un compte">
          <i aria-describedby="icon-core-user-fill-aria-description"
          aria-hidden="true"
          aria-labelledby="icon-core-user-fill-aria-label"
          class="core-icon icon-core-user-fill" data-v-0225089c="">
            <!-- -->
            <!-- -->
          </i>
          <span class="sr-only" data-v-0225089c="">Se
          connecter/cr&#195;&#169;er un compte</span>
        </button>
        <!-- -->
        <!-- -->
      </div>
      <div class="select-list item" data-v-0225089c=""
      data-v-5ec8cd6c="" id="header-language-picker"
      role="listbox">
        <div class="select-list-slot-wrapper" data-v-5ec8cd6c=""
        tabindex="0">
          <div class="icon-menu-item" data-v-0225089c=""
          data-v-5ec8cd6c="" role="option"
          title="Fran&#195;&#167;ais">Fran&#195;&#167;ais</div>
        </div>
        <!-- -->
      </div>
      <div class="select-list item" data-v-0225089c=""
      data-v-5ec8cd6c="" id="header-currency-picker"
      role="listbox">
        <div class="select-list-slot-wrapper" data-v-5ec8cd6c=""
        tabindex="0">
          <div class="icon-menu-item" data-v-0225089c=""
          data-v-5ec8cd6c="" role="option" title="USD">USD</div>
        </div>
        <!-- -->
      </div>
      <button aria-label="Menu"
      class="icon-core-menu-fill header-option item"
      data-v-0225089c="" title="Menu">
        <span class="sr-only" data-v-0225089c="">Menu</span>
      </button>
    </div>
    <!-- -->
    <!-- -->
  </header>
</div>

Note that there are no <a> elements in there other than this one:

<a class="logo nuxt-link-active" data-v-0225089c="" href="/"
title="Hostelworld">
  <img alt="Hostelworld" data-v-0225089c=""
  src="/_nuxt/img/789e4da.svg" />
</a>

That means you code is working correctly. The problem is that the content you see in your browser is loaded via Javascript after loading the initial HTML content. You can see this if you use the developer tools in your browser to watch the network requests that result from loading the page. For the example you've given, we see a request for a JSON document that looks like:

GET https://api.m.hostelworld.com/2.2/cities/14/properties/?currency=USD&application=web&user-id=ae71369c-e75b-45e5-8ffa-97223f9daf22&date-start=2021-04-30&num-nights=3&guests=2&per-page=1000&show-rooms=1&property-num-images=30

If you request that URL, you get a JSON document that probably has all the information you want:

>>> import requests
>>> url='https://api.m.hostelworld.com/2.2/cities/14/properties/?currency=USD&application=web&user-id=ae71369c-e75b-45e5-8ffa-97223f9daf22&date-start=2021-04-30&num-nights=3&guests=2&per-page=1000&show-rooms=1&property-num-images=30'
>>> res=requests.get(url)
>>> print('\n'.join(property['name'] for property in res.json()['properties']))
Enjoy Hostel
Peace & Love Hostel
Generator Paris
Le Village Montmartre by Hiphophostels
Aloha Eiffel Tower by Hiphophostels
MEININGER Paris Porte de Vincennes
Smart Place Paris Gare du Nord by Hiphophostels
.
.
.
larsks
  • 277,717
  • 41
  • 399
  • 399