0

I am trying to scrape a website written in php to extract some information from a particular table. Here is the scenario.

On the landing page there is a form that can take queries from user and based on that search for the results. If I ignore those fields and click on "Submit" it will produce the whole result (Which is what I am interested in). Before I did not know about HTTPWebRequest class and I was simply passing the URL to Htmlweb.load(URL) method in HtmlAgilityPack library and obviously was not the way to go.

Then I searched for HTTPWebRequest and I found an example which is like this

    Dim cookies As New CookieContainer
    Dim postData As String = "postData obtained using live httpheaders pluging in firefox"
    Dim encoding As New UTF8Encoding
    Dim byteData As Byte() = encoding.GetBytes(postData)


    Dim postRequest As HttpWebRequest = DirectCast(WebRequest.Create("URL"), HttpWebRequest)
    postRequest.Method = "POST"
    postRequest.KeepAlive = True
    postRequest.CookieContainer = cookies
    postRequest.ContentType = "application/x-www-form-urlencoded"
    postRequest.ContentLength = byteData.Length
    postRequest.Referer = "Referer Page"
    postRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)"

    Dim postreqstream As Stream = postRequest.GetRequestStream()
    postreqstream.Write(byteData, 0, byteData.Length)
    postreqstream.Close()
    Dim postresponse As HttpWebResponse

    postresponse = DirectCast(postRequest.GetResponse(), HttpWebResponse)
    cookies.Add(postresponse.Cookies)
    Dim postreqreader As New StreamReader(postresponse.GetResponseStream())

    Dim thepage As String = postreqreader.ReadToEnd

Now when I output thepage variable to a browser in vb form, I can see the page that I want (Containing tables). At this point I simply passed the URL of that page to htmlagilitypack like so

    Dim web As New HtmlAgilityPack.HtmlWeb()
    Dim htmlDoc As HtmlAgilityPack.HtmlDocument = web.Load("URL")
    Dim tabletag As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes("//table")
    Dim tablenode As HtmlNode = htmlDoc.DocumentNode.SelectSingleNode("//table[@summary='List of services']")

    If Not tabletag Is Nothing Then

        Console.WriteLine("YES")

    End If

But tabletag variable is nothing. I want to know where I am going wrong? Also is there anyway to get the URL straight from httpwebrespone so I can pass into web.load method ?

thank you

Rob Schneider
  • 679
  • 4
  • 13
  • 27
  • I realised the problem is with the scripts running in that page. So the webbrowser shows the page after the scripts are done but the textbox show the html file before and thats why it doesnt have the tables. Now the question is how I can wait for the scripts to run and then read the html ? – Rob Schneider Jul 16 '12 at 20:27
  • "when I output thepage variable to a browser": if you output the value of `thepage` to a text file and examine it, does that contain the table? – Andrew Morton May 29 '13 at 19:45

1 Answers1

0

If the content you want is built through JavaScript, you can't run JavaScript through HtmlAgilityPack Load method or any simple URL loader client like WebRequest. They don't process and they don't interact with webpages like browsers do. Otherwise you could just load directly from your stream like this:

Dim htmlDoc As New HtmlAgilityPack.HtmlDocument
htmlDoc.Load(postresponse.GetResponseStream())

First suggestion: You can load the form page URL in the WebBrowser and then manage to fill the form and click the submit button programatically accessing the HTMLDocument via DOM. More info in posts like this and this.

Second suggestion: WebBrowser gets a little tricky to handle when you don't want to have a visual event-driven control in your screen or in worst scenario, when you want to manipulate webpages in background threads. In this case, you can use the STAThread solution here and here or use one of called headless browsers like Selenium or HtmlUnit, WatiN and do the same using their DOM access.

Community
  • 1
  • 1
natenho
  • 5,231
  • 4
  • 27
  • 52