4

I am trying to scrape specific string on webpage below :

https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;

The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0" 

I am using "puppeteer" and below is my code :

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

But I cannot find the strings I am looking for in response.text() or page.content().

Am I using the wrong methods in page ?

How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?

Jia
  • 2,417
  • 1
  • 15
  • 25

2 Answers2

2

If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):

<select
  class="hprt-nos-select"
  name="nr_rooms_4377601_232287150_0_1_0"
  data-component="hotel/new-rooms-table/select-rooms"
  data-room-id="4377601"
  data-block-id="4377601_232287150_0_1_0"
  data-is-fflex-selected="0"
  id="hprt_nos_select_4377601_232287150_0_1_0"
  aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>

You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:

await page.waitForSelector('.hprt-nos-select', { timeout: 0 });

BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).

You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.

theDavidBarton
  • 7,643
  • 4
  • 24
  • 51
  • Yes, what you said is correct , "puppeteer" is not taken my url parameter into account , so my url don't involve the info I actually looking for , @thedavidbarton , is there a way to let puppeteer accept my url parameter ? – Jia Aug 28 '20 at 04:03
  • I am not sure if there is a way. Maybe that would work if you’d reuse cookies from your manual page visits, but in that case you need to do a lot of things manually as well. I suggest to automate the whole process from start to end with user-like actions: select the dates with page.click. that way it will work. – theDavidBarton Aug 28 '20 at 06:55
  • One finding is : when I disable "headless" mode "const browser = await puppeteer.launch({ headless: false })" , url parameter is still valid when I visit the page . But I don't know why yet – Jia Aug 28 '20 at 07:05
  • If headful mode helps with the query params, you can use this shady npm package to make your headless chrome act like a headful chrome: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth – theDavidBarton Aug 28 '20 at 15:36
0

Seems booking.com is blocking you. I strongly recommend you use Puppeteer with puppeteer-extra and puppeteer-extra-plugin-stealth packages to prevent website detection that you are using headless Chromium or that you are using a web driver.

And after you go to the URL you need to wait until the page loads:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

const { executablePath } = require("puppeteer");

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ["--no-sandbox", "--disable-setuid-sandbox", "--window-size=1600,900", "--single-process"],
    executablePath: executablePath(),
  });

  const page = await browser.newPage();
  await page.setViewport({
    width: 1280,
    height: 720,
  });
  const url = "https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl";
  await page.goto(url);
  // wait for load selector with id=hp_hotel_name
  await page.waitForSelector("#hp_hotel_name");

  // now you can do what you want

  await browser.close();
})();

As an alternative, to get all info about the hotel you can use hotels-scraper-js library. Then your code will be:

import { booking } from "hotels-scraper-js";

booking.getHotelInfo("https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html").then((result) => console.dir(result, { depth: null }));

The output will looks like:

{
   "title":"Sanadome Nijmegen",
   "type":"Hotel",
   "stars":4,
   "preferredBadge":true,
   "subwayAccess":false,
   "sustainability":"",
   "address":"Weg door Jonkerbos 90, 6532 SZ Nijmegen, Netherlands",
   "highlights":[

   ],
   "description":"You're eligible for a Genius discount at Sanadome Nijmegen!"... and more description,
   "descriptionHighlight":"Couples particularly like the location — they rated it 8.3 for a two-person trip.",
   "descriptionSummary":"Sanadome Nijmegen has been welcoming Booking.com guests since 10 Jun 2010.",
   "facilities":["Indoor swimming pool", "Parking on site",... and more facilities],
   "areaInfo":[
      {
         "What's nearby":[
            {
               "place":"Goffertpark",
               "distance":"650 m"
            },
            ... and more nearby places
         ]
      },
      ... and other area info
   ],
   "link":"https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html",
   "photos":[
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/196181914.jpg?k=e37d21c8a403e920b868bcd7845dbca656d772bc114dc10473a76de52afc67bc&o=&hp=1",
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/225703925.jpg?k=0d4938ca6752057ba607d2fd7fb8cf95cec000770a68738b92ef3b6688e8a62e&o=&hp=1",
      ... and other photos
   ],
   "reviewsInfo":{
      "score":7.8,
      "scoreDescription":"Rated good",
      "totalReviews":823,
      "categoriesRating":[
         {
            "Staff":8.5
         },
         ... and other categories
      ],
      "reviews":[
         {
            "name":"Ewelina",
            "avatar":"https://cf.bstatic.com/static/img/review/avatars/ava-e/8d80ab6bf73fa873e990c76bfc96a1bf23708307.png",
            "country":"Poland",
            "date":"16 February 2023",
            "reting":"10",
            "review":[
               {
                  "liked":"very beautiful surroundings.  I love the peace and quiet around "
               }
            ]
         },
         ... and other reviews
      ]
   }
}
Mikhail Zub
  • 454
  • 3
  • 9