2

for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page.

I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. Therefore I cannot simply use driver.Navigate().GoToUrl() method to retrieve the page source.

The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source.

In a sample project I tried the following without success:

        WebClient wc = new WebClient();
        string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
        string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
        File.WriteAllText(tmpPath, content);

        var driverService = PhantomJSDriverService.CreateDefaultService();            
        var driver = new PhantomJSDriver(driverService);
        driver.Navigate().GoToUrl(new Uri(tmpPath));
        string renderedContent = driver.PageSource;
        driver.Quit();

You need the following nuget packages to run the sample: https://www.nuget.org/packages/phantomjs.exe/ http://www.nuget.org/packages/selenium.webdriver

Problem here is that the code stops at GoToUrl() and it takes several minutes until program terminates without even giving me the driver.PageSource.

Doing this returns the correct HTML:

driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;

But I don't want to download the data twice. The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax.

Thank you!

jimbo
  • 582
  • 1
  • 11
  • 28
  • After reading your question again, I don't think it is that much faster to let it run on a local file, because the external files such as javascript and styles still have to be downloaded. You only reduce this by one request. – Artjom B. Jan 16 '15 at 20:58

2 Answers2

2

Without running it, I would bet you need file:/// prior to tmpPath. That is:

    WebClient wc = new WebClient();
    string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
    string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
    File.WriteAllText(tmpPath, content);

    var driverService = PhantomJSDriverService.CreateDefaultService();            
    var driver = new PhantomJSDriver(driverService);
    driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));
    string renderedContent = driver.PageSource;
    driver.Quit();
Dave Bush
  • 2,382
  • 15
  • 12
2

You probably need to allow PhantomJS to make arbitrary requests. Requests are blocked when the domain/protocol doesn't match as is the case when a local file is opened.

var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.LocalToRemoteUrlAccess = true;
driverService.WebSecurity = false; // may not be necessary
var driver = new PhantomJSDriver(driverService);

You might need to combine this with the solution of Dave Bush:

driver.Navigate().GoToUrl(new Uri("file:///" + tmpPath));

Some of the resources have URLs that begin with // which means that the protocol of the page is used when the browser retrieves those resources. When a local file is read, this protocol is file:// in which case none of those resources will be found. The protocol must be added to the local file in order to download all those resources.

File.WriteAllText(tmpPath, content.Replace('"//', '"http://'));

It is apparent from your output that you use PhantomJS 1.9.8. It may be the case that a newly introduced bug is responsible for this sort of thing. You should user PhantomJS 1.9.7 with driverService.SslProcotol = 'tlsv1'.


You should also enable the disk cache if you do this multiple times for the same domain. Otherwise, the resources are downloaded each time you try to scrape it. This can be done with driverService.DiskCache = true;

Community
  • 1
  • 1
Artjom B.
  • 61,146
  • 24
  • 125
  • 222
  • I haven't tested this, but this is usually the solution for plain PhantomJS. – Artjom B. Jan 16 '15 at 18:50
  • Thanks for your quick answer but your solution does not work. I tried with and without Dave's suggestion in addition to your solution. Still infinite loading at GoToUrl() – jimbo Jan 16 '15 at 20:48
  • What does it show when it terminates? Are there any errors? – Artjom B. Jan 16 '15 at 21:01
  • I've seen your output and it doesn't say anything useful. I can't run it myself, so I can't verify it. Some things to try: 1. Does it load the page when you pass the actual newegg url instead of the temporary file? This verifies whether this is the file issue or a site issue. 2. Check with a non-existing file whether errors are thrown. 3. Try to use PhantomJS 1.9.7 instead of 1.9.8. This shouldn't make a difference, but in 1.9.8 was a new bug introduced. 4. Change the extension from htm to html. – Artjom B. Jan 16 '15 at 22:41
  • 1
    1. Yes. 2. No errors on non existing file. PageSource returns an empty html construct then. 3. This works but its extremly slow (about 1 minute load time). Same applies to using 1.9.8 with workaround driverService.SslProtocol = "tlsv1". 4. No difference. Thanks for being so helpful. If you can help me with the extreme slow load time I would be happy. – jimbo Jan 16 '15 at 23:06
  • I'm out of ideas. This may be a bug in GhostDriver or the language bindings. Have you looked through the bug tracker/issues? – Artjom B. Jan 16 '15 at 23:14
  • Thanks for all the effort. My problem is not completly solved but you helped alot. I have to wait 17 hours until I can award the bounty. – jimbo Jan 17 '15 at 00:34
  • I added the solution to the resource problem and the cache hint. – Artjom B. Jan 17 '15 at 09:32
  • Even with the resource fix it's not working for me. Still real slow load time. But thank you anyway. I have found a workaround starting phantomjs as a process with a js file that renderes the local html file. Your information was still helpful. I will award you the bounty as soon as I can (2h). – jimbo Jan 17 '15 at 15:45