1

I have to parse the following HTML Page:

This is my code of parsing using Fizzler, what I want to get is the title, rates, days (sometimes null) and price; the second price after span.But when I run my code, it just could get 2 objects from ListRoomDetails, as following, we have Room Type 1 promotion 10 % and Room type 2 60%, but it skipped the Room type 2 60 % and get the first element of listRoomDetails (Room Type 1 promotion 90%).

I wish to keep all of the Room Type in two ListRoomDetails div

Is there also any way to detect whether or not the days value exists, if it does, get it, otherwise, ignore it.

//HTML File
<div class="ListItem">
     <div class="ListRoom">
          <span class="title">
             <strong>Super Room</strong>
          </span>
      </div>            

     //section to get details of room
     <div class="listRoomDetails">
        <table>
            <thead>
                <tr>
                    Days
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td class = "rates">
                        Room Type 1 promotion 10%
                    </td>
                    <td class = "days">
                        261.00
                    </td>
                                        <td class = "days">

                    </td>
                    <td class="price">
                        <span>290.00&euro;</span>
                        261.00&euro; //get this money
                    </td>

                </tr>
                <tr>
                    <td class = "rates">
                        Room Type 2 promotion 60%
                    </td>
                                        <td class = "days">

                    </td>
                    <td class = "days">
                        261.00
                    </td>
                    <td class="price">
                        <span>290.00&euro;</span>
                        261.00&euro; // get this money
                    </td>

                </tr>
            </tbody>
    </div>
    <div class="listRoomDetails">
        <table>
            <thead>
                <tr>
                    Days
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td class = "rates">
                        Room Type 1 promotion 90%
                    </td>
                                         <td class = "days">

                    </td>
                    <td class = "rates">
                        261.00
                    </td>
                    <td class="price">
                        <span>290.00&euro;</span>
                        261.00&euro;
                    </td>
                </tr>
                <tr>
                    <td class = "rates">
                        Room Type 2 promotion 0 % // type of room
                    </td>
                    <td class = "days">
                        261.00
                    </td>
                    <td class="price">
                        <span>290.00&euro;</span>
                        261.00&euro;
                    </td>

                </tr>
            </tbody>
        </div>
   </div>

Source Code:

        var source = File.ReadAllText("TestHtml/HotelWithAvailability.html");

        var html = new HtmlDocument(); // with HTML Agility pack
        html.LoadHtml(source);

        var doc = html.DocumentNode;

        var rooms = (from listR in doc.QuerySelectorAll(".ListItem")
                     from listR2 in doc.QuerySelectorAll("tbody")
                     select new HotelAvailability
                     {
                         HotelName = listR.QuerySelector(".title").InnerText.Trim(), //get room name

                         TypeRooms = listR2.QuerySelector("tr td.rates").InnerText.Trim(), //get room type

                         Price = listR2.QuerySelector("tr td.price").InnerText.Trim(), //

                     }).ToArray();
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
bluewonder
  • 767
  • 2
  • 10
  • 18

1 Answers1

1

You should query for room details of current room (i.e. ListItem):

var rooms = from r in doc.QuerySelectorAll(".ListItem")
            from rd in r.QuerySelectorAll(".listRoomDetails tbody tr")
            select new HotelAvailability {
                HotelName = r.QuerySelector(".title").InnerText.Trim(),
                TypeRooms = rd.QuerySelector(".rates").InnerText.Trim(),
                Price = rd.QuerySelector(".price span").InnerText.Trim()
             };

For your sample html it produces:

[
  {
     HotelName: "Super Room",
     Price: "290.00&euro;",
     TypeRooms: "Room Type 1 promotion 10%"
  },
  {
    HotelName: "Super Room",
    Price: "290.00&euro;",
    TypeRooms: "Room Type 2 promotion 60%"
  },
  {
    HotelName:  "Super Room",
    Price: "290.00&euro;",
    TypeRooms: "Room Type 1 promotion 90%"
  },
  {
    HotelName: "Super Room",
    Price: "290.00&euro;",
    TypeRooms: "Room Type 2 promotion 0 % // type of room"
  }
]
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • Thank you very much for your guide, but I got one error for Null at the .price span, and actually my idea is to get the price outside of tag and in this case it's 261.00&euro . When I try .price only then it gets both 2 prices, can I try any other way/ Thanks – bluewonder Mar 20 '14 at 15:33
  • @bluewonder Can you give a sample row which gives you error for Null? – Sergey Berezovskiy Mar 20 '14 at 15:36
  • @bluewonder for getting price outside of span you can use `rd.QuerySelector(".price:last-child").InnerText.Trim()` – Sergey Berezovskiy Mar 20 '14 at 15:39
  • Yes I tried both cases, and it has a NULL error at here : Price = rd.QuerySelector(".price span").InnerText.Trim() also with .price:last-child. I'm also checking it again, thank you @Sergey Berezovskiy – bluewonder Mar 20 '14 at 15:41
  • @bluewonder Can you give a sample row which gives you error for Null? – Sergey Berezovskiy Mar 20 '14 at 15:41
  • I'm sorry i dont quite understand what you meant by sample row, did you mean about the td tag? @Sergey Berezovskiy – bluewonder Mar 20 '14 at 15:43
  • @bluewonder actually I mean `tr` but `td` with price also will be fine – Sergey Berezovskiy Mar 20 '14 at 15:45
  • I tried again and it worked for .price but it's a price inside the span tag, so for the outside I tried fizzler ex last-child as u said, but again, it has a null error @Sergey Berezovskiy – bluewonder Mar 20 '14 at 15:49
  • @bluewonder again, without seeing your html which causes error, I cannot say what went wrong. Both options works fine with your sample html – Sergey Berezovskiy Mar 20 '14 at 15:51
  • Thank you, for the html file, sometimes the price is fixed and not have the reduction, so 290.00€261.00€ sometimes just contains one price without the span. for this reason, I'm not sure if price:last-child is gonna work in this case. Thank you @Sergey Berezovskiy – bluewonder Mar 21 '14 at 09:27
  • @bluewonder if there is no span, then text node will still be last child. It will be first child at same time, but this does not matter – Sergey Berezovskiy Mar 21 '14 at 10:15
  • I still have no idea why I still get both prices when I use the .price:last-child. You could try both with my sample html ? @Sergey Berezovskiy – bluewonder Mar 21 '14 at 10:31
  • Hi @Sergey Berezovskiy, I don't know why now in my example the .price:last-child doesn't work and it has a null error – bluewonder Mar 21 '14 at 15:49