0

I am trying to extract the (first 5) urls from a google search page. i tried to extract it using the selenium web driver. i get the firefox opened and the page loads too but the regex does not match the urls on the page. how do i get the urls extracted?

i have used the following code so far:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.openqa.selenium.WebDriver;
import org.openga.selenium.firefox.FirefoxDriver;

public class Weburlext {

public static void main (String[] args){

String line = null;
Webdriver driver = new FirefoxDriver();
driver.ger("http://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=sample%20data");

String regex="@^(http\\:\\/\\/|https\\:\\/\\/)?([a-z0-9][a-z0-9\\-]*\\.)+[a-z0-9][a-z0-9\\-]*$@i";
Pattern p = Pattern.compile(regex,pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(line);

System.out.print(line);

driver.quit();

}
}
  • 1
    [Don't do this](http://stackoverflow.com/questions/22657548/is-it-ok-to-scrape-data-from-google-results), you are risking your IP being blocked by Google. Use Google API for automated access to Google search results. – Amadan Feb 02 '16 at 05:11
  • In the code you have provided line is always null. – Ardesco Feb 02 '16 at 08:38
  • You have to check your regex first. http://www.regexpal.com/ – Sagar007 Feb 02 '16 at 09:07

1 Answers1

0

I'm curious why you are using regex to match the http pattern in PageSource. The proper way to use Selenium to find first 5 results is finding the result elements then get attribute "href". See code below:

driver.get("https://www.google.com.ph/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=sample%20data");

List<WebElement> results = driver.findElements(By.cssSelector("div[class='rc'] > h3 > a"));
results.forEach(e -> System.out.println(e.getAttribute("href")));
Buaban
  • 5,029
  • 1
  • 17
  • 33