0

I'm trying to code a little program in Java, with a small UI, that lets you use some google search's keyword to improve your search.

I have 2 text field (one for the site and one for the keywords) and 2 date pickers to let the user select the date range for the searching result .

When I press the search button it will connect to the following url:

"https://www.google.it/search?q=" + site + Keywords + daterange 
  • site = "site:SITE_MAIN_URL"
  • keywords are the keywords i am looking for
  • daterange = "daterange:JULIAN_DATE_1 - JULIAN_DATE_2"

after all this I fetch the first 10 result, but here's the problem...

If I select no dates I can easily fetch the links

If I set the daterange I get the HTTP 503 error that is the one for service unavailable (if I paste the generated URL on my web browser everything works fine)

(the User Agent is set to mozilla 5.0)

EDIT: didn't post any code :P

//here i generate the site
site = "site:" + website_field.getText();

//here i convert the dates using a class found on the net
d1 = (int) DateLabelFormatter.dateToJulian(date1);
d2 = (int) DateLabelFormatter.dateToJulian(date2);
daterange += "+daterange:" + d1 + "-" + d2;

//here i generate the keywords
keywords = keyword_field.getText();
String[] keyword = keywords.split(" ");
for (int i = 0; i < keyword.length; i++) {
                        tempKeyword += "+" + keyword[i];
                    }

//the query
query = "https://www.google.it/search?q=" + site + tempKeyword + daterange;

//the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect(query).userAgent("Mozilla/5.0").timeout(5000).get();


//fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++) {

    link = links.get(i);
    String temp = link.attr("href");

    // filtering the first 10 google links
    if (temp.contains("url")) //donothing
        if (temp.contains("webcache")) { //donothing
        } else {
            String[] splitTemp = temp.split("=");
            String[] splitTemp2 = splitTemp[1].split("&sa");
            System.out.println(splitTemp2[0]);
            }
        }

After executing all this (NotSoWellWritten)code if i select no date, and i use just the "site" and the "keywords" I can see on the console the first 10 result found on the google search page. If i select a daterange from the datepickers i get the 503 error.

If you wanna try a working query, here's one that search on facebook.com the keyword "dog" starting from the 1st of november to the 15th generated with this "tool"

https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342

`

xiº
  • 4,605
  • 3
  • 28
  • 39
Michele
  • 5
  • 4
  • 1
    could you provide the code you use to make the actual call? are you using a simple URLConnection, Apache HTTP Client, or anything else? – Adrian B. Nov 16 '15 at 15:32
  • 1
    Also, could you provide a sample URL generated that works in the browser and not the code? – Adrian B. Nov 16 '15 at 15:32

2 Answers2

0

I have no problems using the following code:

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main
{
    public static void main(String[] args) throws IOException
    {
        // the connection (wrapped in a try-catch)
        Document jSoupDoc = Jsoup.connect("https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342").userAgent("Mozilla/5.0").timeout(5000).get();

        // fetching the links
        Elements links = jSoupDoc.select("a[href]");
        Element link;
        for (int i = 0; i < links.size(); i++)
        {
            link = links.get(i);
            String temp = link.attr("href");

            // filtering the first 10 google links
            if (temp.contains("url") && !temp.contains("webcache"))
            {
                String[] splitTemp = temp.split("=");
                String[] splitTemp2 = splitTemp[1].split("&sa");
                System.out.println(splitTemp2[0]);
            }
        }
    }
}

The code gives this as output on my computer:

https://www.facebook.com/uniladmag/videos/1912071728815877/
https://it-it.facebook.com/DogEvolutionAsd
https://it-it.facebook.com/DylanDogSergioBonelliEditore
https://www.facebook.com/DelawareCountyDogShelter/
https://www.facebook.com/LostDogAlert/
https://it-it.facebook.com/pages/Toelettatura-Vanity-DOG/270854126382923
https://it-it.facebook.com/washdogsgm
https://www.facebook.com/thedailystar/videos/1193933410623520/
https://www.facebook.com/OakhurstDogPark/
https://www.facebook.com/bigdogdinerco/

A 503 error usually means that the web server is having temporary issues. Specifically:

503: The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.

If this code works but your original code still does not, then your code is not generating the URL you posted and you should investigate further.

Daniel Centore
  • 3,220
  • 1
  • 18
  • 39
  • thanks for the reply, tried it today and i still didnt got any 503s; it must have been some external issue – Michele Nov 17 '15 at 09:41
0

Besides the coding style, I don't see any functional problems with the provided code and it supplies the answers correctly (tested it locally). The problem might reside in the dateToJulian which I don't know what it returns and how the result is cast to int (if information is lost).

Also, consider the case in which the keywords contain dangerous characters and they are unescaped. They should be sanitized beforehand.

Another possibility is that Google is rejecting your queries if you are sending too many too fast. If this was done using a visual browser, you'd get a "We want to make sure you're not a robot." and a CAPTCHA page. That is why I'd recommend leveraging the Google API for your searches. See this SO for more info: How can you search Google Programmatically Java API

Community
  • 1
  • 1
Adrian B.
  • 1,592
  • 1
  • 20
  • 38