for a web crawler project in C# I try to execute Javascript and Ajax to retrieve the full page source of a crawled page.
I am using an existing web crawler (Abot) that needs a valid HttpWebResponse object. Therefore I cannot simply use driver.Navigate().GoToUrl()
method to retrieve the page source.
The crawler downloads the page source and I want to execute the existing Javascript/Ajax inside the source.
In a sample project I tried the following without success:
WebClient wc = new WebClient();
string content = wc.DownloadString("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string tmpPath = Path.Combine(Path.GetTempPath(), "temp.htm");
File.WriteAllText(tmpPath, content);
var driverService = PhantomJSDriverService.CreateDefaultService();
var driver = new PhantomJSDriver(driverService);
driver.Navigate().GoToUrl(new Uri(tmpPath));
string renderedContent = driver.PageSource;
driver.Quit();
You need the following nuget packages to run the sample: https://www.nuget.org/packages/phantomjs.exe/ http://www.nuget.org/packages/selenium.webdriver
Problem here is that the code stops at GoToUrl()
and it takes several minutes until program terminates without even giving me the driver.PageSource.
Doing this returns the correct HTML:
driver.Navigate().GoToUrl("http://www.newegg.com/Product/Product.aspx?Item=N82E16834257697");
string renderedContent = driver.PageSource;
But I don't want to download the data twice. The crawler (Abot) downloads the HTML and I just want to parse/render the javascript and ajax.
Thank you!