1

I am trying to retrieve the html of a Google search query result using Java. That is, if I do a search in Google.com for a particular phrase, I would like to retrieve the html of the resulting web page (the page containing the links to possible matches along with their descriptions, URLs, ect…).

I tried doing this using the following code that I found in a related post:

import java.io.*;
import java.net.*;
import java.util.*;

public class Main {

    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;

        try {
            url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            is = url.openStream();  // throws an IOException
            dis = new DataInputStream(new BufferedInputStream(is));

            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
} 

From: How do you Programmatically Download a Webpage in Java

The URL used in this code was obtained by doing a Google search query from the Google homepage. For some reason I do not understand, if I write the phrase that I want to search for in the URL bar of my web browser and then use the URL of the resulting search result page in the code I get a 403 error.

This code, however, did not return the html of the search query result page. Instead, it returned the source code of the Google homepage.

After doing further research I noticed that if you view the source code of a Google search query result (by right clicking on the background of the search result page and selecting “View page source”) and compare it with the source code of the Google homepage, they are both identical.

If instead of viewing the source code of the search result page I save the html of the search result page (by pressing ctrl+s), I can get the html that I am looking for.

Is there a way to retrieve the html of the search result page using Java?

Thank you!

Community
  • 1
  • 1
Erich
  • 53
  • 1
  • 5

3 Answers3

2

Rather than parsing the resulting HTML page from a standard google search, perhaps you would be better off looking at the official Custom Search api to return results from Google in a more usable format. The API is definitely the way to go; otherwise your code could simply break if Google were to change some features of the google.com front-end's html. The API is designed to be used by developers and your code would be far less fragile.

To answer your question, though: We can't really help you just from the information you've provided. Your code seems to retrieve the html of stackoverflow; an exact copy-and-paste of the code from the question you linked to. Did you attempt to change the code at all? What URL are you actually trying to use to retrieve google search results?

I tried to run your code using url = new URL("http://www.google.com/search?q=test"); and I personally get an HTTP error 403 forbidden. A quick search of the problem says that this happens if I don't provide the User-Agent header in the web request, though that doesn't exactly help you if you're actually getting HTML returned. You will have to provide more information if you wish to receive specific help - though switching to the Custom Search API will likely solve your problem.


edit: new information provided in original question; can directly answer question now!

I figured out your problem after packet-capturing the web request that java was sending and applying some basic debugging... Let's take a look!

Here's the web request that Java was sending with your provided example URL:

GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

Notice that the request seemed to ignore most of the URL... leaving just the "GET /". That is strange. I had to look this one up.

As per the documentation of the Java URL class (and this is standard for all web pages), A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.

Let's take a look at your example URL...

https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951

notice that "#" is the first character in the file path? Java is simply ignoring everything after the "#" because sharp-signs are only used by the client / web browser - this leaves you with the url https://www.google.com/. Hey, at least it was working as intended!

I can't tell you exactly what Google is doing, but the sharp-symbol url definitely means that Google is returning results of the query through some client-side (ajax / javascript) scripting. I'd be willing to bet that any query you send directly to the server (i.e- no "#" symbol) without the proper headers will return a 403 forbidden error - looks like they're encouraging you to use the API :)

edit2: As per Tengji Zhang answer to the question, here is working code that returns the result of the google query for "test"

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;
    URLConnection c;

    try {
        url = new URL("https://www.google.com/search?q=test");
        c = url.openConnection();
        c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
        c.connect();
        is = c.getInputStream();
        dis = new DataInputStream(new BufferedInputStream(is));
        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe ) {
            // nothing to see here
        }
    }
Alex Lynch
  • 951
  • 5
  • 11
  • Thank you for all the information! I will look into the Google API. However, I would like to understand why the Java code does not return the desired result. I updated the original post with the code I used and added an explanation of how I got a URL that did not produce a 403 error. I hope this makes it more understandable. – Erich Jun 26 '12 at 04:27
  • @Kyndod7 Not sure if you receive notifications for my edit - but I answered your question :) Why are you trying to programmatically google search the name of my university? :) – Alex Lynch Jun 26 '12 at 05:43
  • Thank you so much Alex! I just randomly choose UCF when I was testing the code, it is my university too :) – Erich Jun 27 '12 at 03:22
1

I suggest you try http://seleniumhq.org/

There is a good tutorial of searching in google

http://code.google.com/p/selenium/wiki/GettingStarted

Jianyu
  • 412
  • 5
  • 11
-1

you don't set the User-Agent in your code.

URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");

Or you can read that "http://www.google.com/robots.txt". This file tells you which url is allowed by the google servers.

The below code is successful.

package org.test.stackoverflow;

import java.io.*;
import java.net.*;
import java.util.*;

public class SearcherRetriver {
    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;
        URLConnection c;

        try {
            url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            c = url.openConnection();
            c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
            c.connect();
            is = c.getInputStream();
            dis = new DataInputStream(new BufferedInputStream(is));
            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
}
Jimmy Zhang
  • 939
  • 1
  • 10
  • 15
  • Your code does not work. I tested with google.com rather than google.com.hk - but it should make no difference. See my answer as to why it does not work. – Alex Lynch Jun 26 '12 at 09:03
  • My code is useful in my computer. @Kyndod7's code is not follow the google's crawler's rules. So get the error 403. – Jimmy Zhang Jun 26 '12 at 11:34
  • Yes but your code still returns the google homepage rather than actual search results. The 403 error does not occur because you *never actually perform a google search*. Only the google homepage HTML is returned, not the HTML of a search query (which is what the author wants). If you combine your request header with a URL that *will actually return search results*, then your code is correct and the OP's question is answered. But in it's current state, your answer does not describe why the OP's code does not return HTML related to a search query. – Alex Lynch Jun 26 '12 at 11:46