2

I'm trying to scrape a web page using C#, however after the page loads, it executes some JavaScript which loads more elements into the DOM which I need to scrape. A standard scraper simply grabs the html of the page on load and doesn't pick up the DOM changes made via JavaScript. How do I put in some sort of functionality to wait for a second or two and then grab the source?

Here is my current code:

private string ScrapeWebpage(string url, DateTime? updateDate)
{
    HttpWebRequest request = null;
    HttpWebResponse response = null;
    Stream responseStream = null;
    StreamReader reader = null;
    string html = null;
    try
    {
        //create request (which supports http compression)
        request = (HttpWebRequest)WebRequest.Create(url);
        request.Pipelined = true;
        request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
        if (updateDate != null)
            request.IfModifiedSince = updateDate.Value;
        //get response.
        response = (HttpWebResponse)request.GetResponse();
        responseStream = response.GetResponseStream();
        if (response.ContentEncoding.ToLower().Contains("gzip"))
            responseStream = new GZipStream(responseStream,
                CompressionMode.Decompress);
        else if (response.ContentEncoding.ToLower().Contains("deflate"))
            responseStream = new DeflateStream(responseStream,
                CompressionMode.Decompress);
        //read html.
        reader = new StreamReader(responseStream, Encoding.Default);
        html = reader.ReadToEnd();
    }
    catch
    {
        throw;
    }
    finally
    {
        //dispose of objects.
        request = null;
        if (response != null)
        {
            response.Close();
            response = null;
        }
        if (responseStream != null)
        {
            responseStream.Close();
            responseStream.Dispose();
        }
        if (reader != null)
        {
            reader.Close();
            reader.Dispose();
        }
    }
    return html;
}

Here's a sample URL:

http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4

You'll see when the page first loads it says 134 listings found, then after a second it says 187 properties found.

Toni
  • 1,555
  • 4
  • 15
  • 23
Justin
  • 17,670
  • 38
  • 132
  • 201

4 Answers4

5

To execute the JavaScript I use webkit to render the page, which is the engine used by Chrome and Safari. Here is an example using its Python bindings.

Webkit also has .NET bindings but I haven't used them.

hoju
  • 28,392
  • 37
  • 134
  • 178
4

The approach you have will not work regardless how long you wait, you need a browser to execute the javascript (or something that understands javascript).

Try this question: What's a good tool to screen-scrape with Javascript support?

Community
  • 1
  • 1
ilivewithian
  • 19,476
  • 19
  • 103
  • 165
  • Thanks for the response, however I wasn't able to find one that works correctly for C#. I tried Selenium and its browser drivers worked, but they open browser windows which doesn't work for me. I then tried a .NET dll port of Java's HtmlUnit, however that is super slow and throws obscure errors. I need someone who has gotten this working to share which they used and to show some code. – Justin Apr 12 '11 at 19:12
1

You would need to execute the javascript yourself to get this functionality. Currently, your code only receives whatever the server replies with at the URL you request. The rest of the listings are "showing up" because the browser downloads, parses, and executes the accompanying javascript.

dlev
  • 48,024
  • 5
  • 125
  • 132
  • I know this, however I'm not a browser so I don't know have the capability to execute javascript myself. If you have that ability then you are amazing. – Justin Apr 12 '11 at 19:13
1

The answer to this similar question says to use a web browser control to read the page in and process it before scraping it. Perhaps with some kind of timer delay to give the javascript some time to execute and return results.

Community
  • 1
  • 1
Shane Wealti
  • 2,252
  • 3
  • 19
  • 33
  • The web browser control is a good solution for some, but in mine it doesn't work since it requires an STA thread and this is a high performance multi-threaded app using Parallel.Foreach so I don't think they can play well together. – Justin Apr 13 '11 at 17:55