0

I am attempting to get the names and prices of the listings on a cruise website.

import requests
from bs4 import BeautifulSoup

URL = 'https://www.ncl.com/vacations?cruise-destination=transatlantic'
page = requests.get(URL)


soup = BeautifulSoup(page.content, "html.parser")
names = soup.find_all('h2', class_='headline -medium -small-xs -variant-1')
prices = soup.find_all('span', class_='headline-3 -variant-1')

print(names)
print(prices)

This just ends up printing brackets.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Blythe
  • 19
  • 3
  • I don't see `headline-3` in the HTML source. And the only `headline` is `headline -small`, not `headline -medium -small-xs -variant-1` – Barmar Jul 15 '21 at 00:16
  • 1
    The elements you're looking for are added dynamically by JS. You need to use Selenium WebDriver to get the dynamic content. – Barmar Jul 15 '21 at 00:18

1 Answers1

1

BeautifulSoup can only see HTML elements which exist in the HTML document at the time the document is served to you from the server. It cannot see elements in the DOM which normally would be populated/created asynchronously using JavaScript (by a browser).

The page you're trying to scrape is of the second kind: The HTML document the server served to you at the time you requested it only contains the "barebones" scaffolding of the page, which, if you're viewing the page in a browser, will be populated at a later point in time via JavaScript. This is typically achieved by the browser by making additional requests to other resources/APIs, whose response contains the information with which to populate the page.

BeautifulSoup is not a browser. It's just an HTML/XML parser. You made a single request to a (mostly empty) template HTML. You can expect BeautifulSoup NOT to work for any "fancy" pages - if you see a spinning "loading" graphic, you should immediately think "this page is populated asynchronously using JavaScript and BeautifulSoup won't work for this".

There are cases where the information you're trying to scrape is actually embeded somewhere in the HTML at the time the server served it to you - in a <script> tag possibly, and then the browser is expected to use JavaScript to make this data presentable. In such a case, BeautifulSoup would be able to see the data - that's a separate matter, though.

In your case, one solution would be to view the page in a browser, and log your network traffic. Doing this reveals that, once the page loads, an XHR HTTP GET request is made to a REST API endpoint, the response of which is JSON and contains all the information you're trying to scrape. The trick then is to imitate that request: copy the endpoint URL (including query-string parameters) and any necessary request headers (and payload, if it's a POST request. In this case, it isn't).

Inspecting the response gives us further clues on how to write our script: The JSON response contains ALL itineraries, even ones we aren't interested in (such as non-transatlantic trips). This means that, normally, the browser must run some JavaScript to filter the itineraries - this happens client-side, not server-side. Therefore, our script will have to perform the same kind of filtering.

def get_itineraries():
    import requests

    url = "https://www.ncl.com/fr/en/api/vacations/v1/itineraries"

    params = {
        "guests": "2",
        "v": "1414764913-1626184979267"
    }

    headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate",
        "user-agent": "Mozilla/5.0"
    }
    response = requests.get(url, params=params, headers=headers)
    response.raise_for_status()

    def predicate(itinerary):
        return any(dest["code"] == "TRANSATLANTIC" for dest in itinerary["destination"])

    yield from filter(predicate, response.json()["itineraries"])

def main():

    from itertools import islice

    def get_cheapest_price(itinerary):

        def get_valid_option(sailing):

            def predicate(option):
                return "combinedPrice" in option

            return next(filter(predicate, sailing["pricing"]))

        return min(get_valid_option(sailing)["combinedPrice"] for sailing in itinerary["sailings"])

    itineraries = list(islice(get_itineraries(), 50))
    
    prices = map(get_cheapest_price, itineraries)

    for itinerary, price in sorted(zip(itineraries, prices), key=lambda tpl: tpl[1]):
        print("[{}{}] - {}".format(itinerary["currency"]["symbol"], price, itinerary["title"]["fullTitle"]))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

[€983] - 12-Day Transatlantic From London To New York: Spain & Bermuda
[€984] - 11-Day Transatlantic from Miami to Barcelona: Ponta Delgada, Azores
[€1024] - 15-Day Transatlantic from Rio de Janeiro to Barcelona: Spain & Brazil
[€1177] - 15-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1190] - 14-Day Transatlantic from Barcelona to New York: Spain & Bermuda
[€1234] - 14-Day Transatlantic from Lisbon to Rio de Janeiro: Spain & Brazil
[€1254] - 11-Day Europe from Rome to London: Italy, France, Spain & Portugal
[€1271] - 15-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1274] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€1296] - 13-Day Transatlantic From New York to London: France & Ireland
[€1411] - 17-Day Transatlantic from Rome to Miami: Italy, France & Spain
[€1420] - 15-Day Transatlantic From New York to Barcelona: France & Spain
[€1438] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1459] - 15-Day Transatlantic from Barcelona to Tampa: Bahamas, Spain & Bermuda
[€1473] - 11-Day Transatlantic from New York to Reykjavik: Halifax & Akureyri
[€1486] - 16-Day Transatlantic from Rome to New York: Italy, France & Spain
[€1527] - 15-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1529] - 14-Day Transatlantic From New York to London: France & Ireland
[€1580] - 16-day Transatlantic From Barcelona to New York: Spain & Bermuda
[€1595] - 16-Day Transatlantic From New York to Rome: Italy, France & Spain
[€1675] - 16-Day Transatlantic from New York to Rome: Italy, France & Spain
[€1776] - 14-Day Transatlantic from New York to London: England & Ireland
[€1862] - 12-Day Transatlantic From London to New York: Scotland & Iceland
[€2012] - 15-Day Transatlantic from New York to Barcelona: Spain & Bermuda
[€2552] - 14-Day Transatlantic from New York to London: England & Ireland
[€2684] - 16-Day Transatlantic from New York to London: France & Ireland
[€3460] - 16-Day Transatlantic from New York to London: France & Ireland
>>> 

For more information on logging your browser's network traffic, finding REST API endpoints (if they exist), and imitating requests, take a look at this other answer I posted to a similar question.

Paul M.
  • 10,481
  • 2
  • 9
  • 15