2

I have got this new project that I am not familiar in working with. One task is that I need to navigate some websites to collect some data. One sample website would be this: https://www.hudhomestore.com/Home/Index.aspx

enter image description here

I have read and watched tutorials on "collecting" data from a web page, such as:

But my question is how do we usually set preferences, to "search" based on our preferences, and then use the above links to load the results in my code?

EDIT

This is correct for setting the searching criteria based on my selection. However, total count of the search (If I do it manually for MI state) is 223, but i I execute the below code, tdNodeCollection is only 121. Can you show me where am I going wrong?

    HtmlWeb web = new HtmlWeb();
    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    string zipCode = "", city = "", county = "", street = "", sState = "MI", fromPrice = "0", toPrice = "0", fcaseNumber = "",
           bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
           stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";

    var doc = await (Task.Factory.StartNew(() => web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
        "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState +
        "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
        "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath +
        "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities +
        "&outdoorAmenities=" + outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories +
        "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage)));

    HtmlNodeCollection tdNodeCollection = doc
                             .DocumentNode
                             .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");
Community
  • 1
  • 1
Khalil Khalaf
  • 9,259
  • 11
  • 62
  • 104
  • Can you explain a bit about "set preferences to search". – M. Adeel Khalid Feb 07 '17 at 03:57
  • In theory each search criteria represents a key/value in the database, in this particular example the form is submitted using the GET method, where the search criterias are passed as query strings in the URL, then used a template where the search results are displayed based on the results retrieve from the DB – Sergio Alen Feb 07 '17 at 03:57
  • Hi @MAdeelKhalid yes of course. For example, in my application, I would like to ask the user what state would he like to view, to then display him the result. So how can I "query" the website with a specific "filter", to then go to a result page and parse that page into my code. – Khalil Khalaf Feb 07 '17 at 03:58
  • @SergioAlen Is it doable in my case? To query their DB from my application and retrieve results? – Khalil Khalaf Feb 07 '17 at 04:01
  • @SergioAlen is telling about WebService or MVC pattern and your question is something else I think, right? – M. Adeel Khalid Feb 07 '17 at 04:03
  • yes, the form needs to post to action="results-template.aspx", in that template you would have your code to query the database – Sergio Alen Feb 07 '17 at 04:04
  • @MAdeelKhalid my question is like, how can I start from [this page](https://www.hudhomestore.com/Home/Index.aspx), and reach [this page](https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?zipCode=&city=&county=&street=&sState=AK&fromPrice=0&toPrice=0&fcaseNumber=&bed=0&bath=0&buyerType=0&Status=0&indoorAmenities=&outdoorAmenities=&housingType=&stories=&parking=&propertyAge=&sLanguage=ENGLISH), to then parse the results? Notice that I have selected a state and pressed on "search" – Khalil Khalaf Feb 07 '17 at 04:05
  • @SergioAlen I am sorry I did not follow that, can you explain more about _post to action_? And what is a template? – Khalil Khalaf Feb 07 '17 at 04:08
  • You are talking about crawling a whole website. That's the only way you can go through one page to another. – M. Adeel Khalid Feb 07 '17 at 04:09
  • @MAdeelKhalid Could you show me how? Or recommend to take a look somewhere? – Khalil Khalaf Feb 07 '17 at 05:09
  • I'll write an answer having suggestions of what you can do, can you tell me, are you developing a desktop or web app? – M. Adeel Khalid Feb 07 '17 at 06:03
  • @MAdeelKhalid I am developing on Desktop, I started a WPF and a WFA and tried [this solution](https://www.youtube.com/watch?v=4cPPD-MFadQ) so far, which did not succeed; as my `nodes` is junk if I use this Xpath `//*[@id=\"dgPropertyList\"]//tr//td` in this web `https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?sState=MI&sLanguage=ENGLISH` – Khalil Khalaf Feb 07 '17 at 06:40

1 Answers1

2

You can make use of HTMLAgilityPack for this purpose. I've made a small testing code and tested with the second page you wish to scrap based on the search criteria which you can set.

        HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlWeb web = new HtmlWeb();
        //string InitialUrl = "https://www.hudhomestore.com/Home/Index.aspx";
        //Here you need to set the values of these variable to whatever user inputs
        //after setting these values, add them to initial URL
        string zipCode = "", city = "", county = "", street = "", sState = "AK", fromPrice = "0", toPrice = "0", fcaseNumber = "",
               bed = "0", bath = "0", buyerType = "0", Status = "0", indoorAmenities = "", outdoorAmenities = "", housingType = "",
               stories = "", parking = "", propertyAge = "", sLanguage = "ENGLISH";
        HtmlAgilityPack.HtmlDocument document = web.Load("https://www.hudhomestore.com/Listing/PropertySearchResult.aspx?" +
            "zipCode=" + zipCode + "&city=" + city + "&country=" + county + "&street=" + street + "&sState=" + sState + 
            "&fromPrice=" + fromPrice + "&toPrice=" + toPrice +
            "&fcaseNumber=" + fcaseNumber + "&bed=" + bed + "&bath=" + bath + 
            "&buyerType=" + buyerType + "&Status=" + Status + "&indoorAmenities=" + indoorAmenities + 
            "&outdoorAmenities=" +outdoorAmenities + "&housingType=" + housingType + "&stories=" + stories + 
            "&parking=" + parking + "&propertyAge=" + propertyAge + "&sLanguage=" + sLanguage);
        HtmlNodeCollection tdNodeCollection = document
                                 .DocumentNode
                                 .SelectNodes("//*[@id=\"dgPropertyList\"]//tr//td");

Count them again and look at your expression, there are exactly 121 td's within tr with id="dgPropertyList" Next, check your td manually and trace what you need from that td and fetch that data.

            foreach (HtmlAgilityPack.HtmlNode node in tdNodeCollection)
            {
                //Do you say you want to access to <h2>, <p> here?
                //You can do:
                HtmlNode h2Node = node.SelectSingleNode("./h2"); //That will get the first <h2> node
                HtmlNodeCollection allH2Nodes = node.SelectNodes(".//h2"); //That will search in depth too

                //And you can also take a look at the children, without using XPath (like in a tree):        
                HtmlNode h2Node_ = node.ChildNodes["h2"];
            }

I've tested the code, it works and parse the whole document to reach the required table. It will get you all the rows within that table inside div. So, you can further dig into these rows, find your td and get what you need.

Another option could be using Selenium webdriver, Get your hands on Selenium

If you don't want the browser to be visible and still want to use Selenium like functionality then you can make use of PhantomJS

Hope it helps.

Community
  • 1
  • 1
M. Adeel Khalid
  • 1,786
  • 2
  • 21
  • 24
  • Awesome, thank you! Can you see my edit? Why it is 121 only? Could you please debug it with me? Also, how can I dig in each node to retrieve the link of each item? Can I just search the string `InnerHTML` of each node? – Khalil Khalaf Feb 07 '17 at 08:00
  • Look at the modified answer. – M. Adeel Khalid Feb 07 '17 at 08:19
  • I think you answered my question so I accepted your answer, however I am unfamiliar with any of these `h2` `and `tr` stuff so I will try another approach and I have another question if you can help: http://stackoverflow.com/questions/42084130/how-to-click-a-button-on-a-web-page-that-has-no-id?noredirect=1#comment71338786_42084130 And thanks again Adeel ! – Khalil Khalaf Feb 07 '17 at 08:30
  • And by the way when I execute your new code, `h2Node`, `allH2Nodes` and `h2Node_` are always `null` – Khalil Khalaf Feb 07 '17 at 08:32
  • That was just an example to fetch `h2` tag of html within that node nothing else. There might be no `h2` within that `td` node. Its a pleasure for me If I can help. :) – M. Adeel Khalid Feb 07 '17 at 08:33