0

I have just started using Selenium Web Driver and I am stuck with a problem: I want to download a web page's source to my Java program. I have tried using driver.getPageSource() with HtmlUnit driver but the result I got does not exactly match the result I got when I manually did the following:

right click on the browser -> view page source.

I am not able to figure out what the problem is. Is there a different API for my purpose or am I using the wrong driver here? Should I use a chrome driver instead of the HtmlUnit driver? If yes, how to use the chrome driver?

Here is what I am doing:

    WebDriver driver = new HtmlUnitDriver();
    driver.get(webPage);
    System.out.println(driver.getPageSource());
alex
  • 10,900
  • 15
  • 70
  • 100
Vasanth Nag K V
  • 4,860
  • 5
  • 24
  • 48
  • 1
    What do you mean, "does not exactly match the result"? What's different? Is it giving you the source of another page? Is there JavaScript on the page altering the content? I'm curious if that would affect this. – Avery Nov 13 '13 at 21:29
  • these are some elements that are missing. for example. SHANKAR HARDWARE this doies not come up with the method that i have used as shown in the question. but when i do ciew page source, it will show up.. – Vasanth Nag K V Nov 14 '13 at 04:37

2 Answers2

2

I've just check out Fluent Selenium which uses Firefox WebDriver. It's a testing framework, so don't be surprised by presence of asserting methods. It can be used for crawling. Worked perfectly for me with very little configuration. It requires Maven to run, here is my working example:

package fluent;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.seleniumhq.selenium.fluent.FluentWebDriver;
import org.seleniumhq.selenium.fluent.Period;
import org.seleniumhq.selenium.fluent.TestableString;

import java.util.concurrent.TimeUnit;

import static org.openqa.selenium.By.className;

public class Test {
    public static void main(String[] args) {
        WebDriver driver = new FirefoxDriver();
        FluentWebDriver fwd = new FluentWebDriver(driver);

        driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
        driver.get("http://www.hudku.com/search/business-list/Paint%20%26%20Hardware%20in%20Kanakapura%20Road,%20Bangalore,%20Karnataka,%20India?p=6&h1=mgK%3DFsPlSAsPTaOVwo%2F0FIMA");

        driver.navigate();

        TestableString test = fwd.div(className("heading")).within(Period.secs(3)).getText();

        System.out.println("header: " + test.toString());

        test.shouldContain("Paint");

        System.out.println("all is fine!");
    }
}

My pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>testPrj3</groupId>
    <artifactId>testPrj3</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.seleniumhq.selenium.fluent</groupId>
            <artifactId>fluent-selenium</artifactId>
            <version>1.14.2</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.hamcrest</groupId>
            <artifactId>hamcrest-all</artifactId>
            <version>1.3</version>
            <scope>test</scope>
        </dependency>

        <!-- If you're needing Coda Hale's Metrics integration (optional) -->
        <dependency>
            <groupId>com.codahale.metrics</groupId>
            <artifactId>metrics-core</artifactId>
            <version>3.0.0</version>
        </dependency>

    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

UPDATE

FluentLenium seems being a little more popular.

Andrey Chaschev
  • 16,160
  • 5
  • 51
  • 68
0

The problem is that the browser sends a string to the webserver that declares what type of browser it is and then the webpage gives you different content, depending on the browser. This is basic web programming fact. Developers have to adjust page content , especially CSS declarations, depending on the browser.

djangofan
  • 28,471
  • 61
  • 196
  • 289
  • hi djangofan, thanks for turning up on my question, the expected result is proper on google chrom so are you saying that i have to use a chrome driver API to download the source code of the web page?? – Vasanth Nag K V Nov 14 '13 at 07:48