2

This is kind of a learning exercise but part 'fun' as well. Basically, I am trying to parse the price of a 'Balcony' state room's price (currently at $1039) in a C# console application. The url is:

http://www.carnival.com/BookingEngine/Stateroom/Stateroom2/?embkCode=PCV&itinCode=SC0&durDays=8&shipCode=SH&subRegionCode=CS&sailDate=08082015&sailingID=68791&numGuests=2&showDbl=False&isOver55=N&isPastGuest=N&stateCode=&isMilitary=N&evsel=&be_version=1

I have the above url loaded fine in:

var document = getHtmlWeb.Load(web_address);

The container for the Balcony prices is a div with class of 'col' and is the 3rd div inside column-container clearfix class. I thought all I would need would be to fine all divs with class per:

var lowest_price = document.DocumentNode.SelectNodes("//div[@class='col-bottom']");

and then select the 3rd node to get to the Balcony prices. But the lowest_price variable keep returning null. I know that the document itself is loaded and I can see inside the 'col' if I select 'col'. Is it the hyphen in the col-bottom which is preventing the finding of that div?

Any alternate way to get to this? As I said, it is mostly a learning exercise. But I am having to create some custom monitoring solutions which require screen scraping and so it is not all just fun.

Thanks!

EDIT HTML snippet containing the relevant info:

    <div class="col">
      <h2 data-cat-title="Balcony" class="uk-rate-title-OB"> Balcony </h2>   <p>&nbsp;</p>
        <div class="col-bottom">
        <h3> From</h3>
         <strong> $1,039.00* <span class="rate-compare-strike"> </span> </strong><a metacode="OB" href="#" class="select-btn">Select</a> </div>
    </div>
IrfanClemson
  • 1,699
  • 6
  • 33
  • 52
  • Can you just take a snippet of the HTML and include that in the question, instead of pointing to a link that will be dead in a month? – John Koerner Mar 27 '15 at 13:51
  • Ok. Done. Too much info in other divs. – IrfanClemson Mar 27 '15 at 13:58
  • 1
    May this: http://stackoverflow.com/questions/5139564/ ? Something in the parsing might be renaming the attribute somehow. You could find out for sure by loading the HTML, then writing code to walk through it to see if the Load() changed the dashes some other way. – Moby Disk Mar 27 '15 at 15:05
  • Thanks. Will look into it; just doesn't make sense because loading would be the same as if loaded inside a browser? If that is the case then all Classes/IDs with dashes would suffer in all applications? – IrfanClemson Mar 27 '15 at 15:08

2 Answers2

3

Nothing is wrong with hyphens in atrribute names or values thats valid html, the problem with your source is that they use javascript on the client to render the the html, to verify that you can download the html page and you will notice that the elements you are looking for do not exist.

To parse such pages where javascript need to be executed first, for that you could use a web browser control and then pass the html to HAP.

Here is a simple example on how to use the WinForms web browser control:

private void ParseSomeHtmlThatRenderedJavascript(){
        var browser = new System.Windows.Forms.WebBrowser() { ScriptErrorsSuppressed = true };

        string link = "yourLinkHere";

        //This will be called when the web page loads, it better be a class member since this is just a simple demonstration
        WebBrowserDocumentCompletedEventHandler onDocumentCompleted = new WebBrowserDocumentCompletedEventHandler((s, evt) => {
            //Do your HtmlParsingHere
            var doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(browser.DocumentText);
            var someNode = doc.DocumentNode.SelectNodes("yourxpathHere");
        });

        //subscribe to the DocumentCompleted event using our above handler before navigating
        browser.DocumentCompleted += onDocumentCompleted;

        browser.Navigate(link);
    }

Also you could take look at Awesomium and some other embedded WebBrowser controls.

Also if you want to run the WebBrowser in console app here is a sample, if you don't get it the use windows forms, this sample is with the help of this SO answer WebBrowser Control in a new thread

    using System;
    using System.Text;
    using System.Threading;
    using System.Threading.Tasks;
    using System.Windows.Forms;
    using HtmlAgilityPack;
    namespace ConsoleApplication276
    {

        // a container for a url and a parser Action
        public class Link
        {
            public string link{get;set;}
            public Action<string> parser { get; set; }
        }

        public class Program
        {

            // Entry Point of the console app
            public static void Main(string[] args)
            {
                try
                {
                    // download each page and dump the content
                    // you can add more links here, associate each link with a parser action, as for what data should the parser generate create a property for that in the Link container

                    var task = MessageLoopWorker.Run(DoWorkAsync, new Link() { 
                        link = "google.com", 
                        parser = (string html) => {

                            //do what ever you need with hap here
                            var doc = new HtmlAgilityPack.HtmlDocument();
                            doc.LoadHtml(html);
                            var someNodes = doc.DocumentNode.SelectSingleNode("//div");

                        } });


                    task.Wait();
                    Console.WriteLine("DoWorkAsync completed.");
                }
                catch (Exception ex)
                {
                    Console.WriteLine("DoWorkAsync failed: " + ex.Message);
                }

                Console.WriteLine("Press Enter to exit.");
                Console.ReadLine();
            }

            // navigate WebBrowser to the list of urls in a loop
            public static async Task<Link> DoWorkAsync(Link[] args)
            {
                Console.WriteLine("Start working.");

                using (var wb = new WebBrowser())
                {
                    wb.ScriptErrorsSuppressed = true;

                    TaskCompletionSource<bool> tcs = null;
                    WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) =>
                        tcs.TrySetResult(true);

                    // navigate to each URL in the list
                    foreach (var arg in args)
                    {
                        tcs = new TaskCompletionSource<bool>();
                        wb.DocumentCompleted += documentCompletedHandler;
                        try
                        {
                            wb.Navigate(arg.link.ToString());
                            // await for DocumentCompleted
                            await tcs.Task;
                            // after the page loads pass the html to the parser 
                            arg.parser(wb.DocumentText);
                        }
                        finally
                        {
                            wb.DocumentCompleted -= documentCompletedHandler;
                        }
                        // the DOM is ready
                        Console.WriteLine(arg.link.ToString());
                        Console.WriteLine(wb.Document.Body.OuterHtml);
                    }
                }

                Console.WriteLine("End working.");
                return null;
            }

        }

        // a helper class to start the message loop and execute an asynchronous task
        public static class MessageLoopWorker
        {
            public static async Task<Object> Run(Func<Link[], Task<Link>> worker, params Link[] args)
            {
                var tcs = new TaskCompletionSource<object>();

                var thread = new Thread(() =>
                {
                    EventHandler idleHandler = null;

                    idleHandler = async (s, e) =>
                    {
                        // handle Application.Idle just once
                        Application.Idle -= idleHandler;

                        // return to the message loop
                        await Task.Yield();

                        // and continue asynchronously
                        // propogate the result or exception
                        try
                        {
                            var result = await worker(args);
                            tcs.SetResult(result);
                        }
                        catch (Exception ex)
                        {
                            tcs.SetException(ex);
                        }

                        // signal to exit the message loop
                        // Application.Run will exit at this point
                        Application.ExitThread();
                    };

                    // handle Application.Idle just once
                    // to make sure we're inside the message loop
                    // and SynchronizationContext has been correctly installed
                    Application.Idle += idleHandler;
                    Application.Run();
                });

                // set STA model for the new thread
                thread.SetApartmentState(ApartmentState.STA);

                // start the thread and await for the task
                thread.Start();
                try
                {
                    return await tcs.Task;
                }
                finally
                {
                    thread.Join();
                }
            }
        }
    }
Community
  • 1
  • 1
Xi Sigma
  • 2,292
  • 2
  • 13
  • 16
0

The Answer by @Decoherence did not work--as can be seen from the Chats above. Basically, using his code, I still ended up with col-bottom as null. So I ended up using the following URL: http://www.icruise.com/8-night-southern-caribbean-cruise_carnival-sunshine_8-8-2015.html?refPage=src

and am able to parse it fine. Though I could take up parsing the url in the Question as a learning exercise/challenge later. Others are also welcome/free to take up the challenge!

FYI.

IrfanClemson
  • 1,699
  • 6
  • 33
  • 52