1

I am using Jsoup to extract data by zip codes from a Web site.The zip codes are read from a text file and the results are written at the console. I have around 1500 zip codes. The program throws two kinds of exceptions:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.moving.com/real-estate/city-profile/...

java.net.SocketTimeoutException: Read timed out

I thought the solution is to read only few data at the time. So, I used a counter, to count 200 zip codes from the text file and I stop the program for 5 minutes after I have data for 200 zip codes. As I said, I still have the exceptions. So far, when I see the exception, I copy paste the available data, and I continue after that with the following zip codes. But I want to read all data without interruptions. Can be this possible? Any hint will be appreciated. Thank you in advance!

This is my code for reading all data:

    while (br.ready())
        {
            count++;

            String s = br.readLine();
            String str="http://www.moving.com/real-estate/city-profile/results.asp?Zip="+s; 
            Document doc = Jsoup.connect(str).get();

            for (Element table : doc.select("table.DataTbl"))
            {
                for (Element row : table.select("tr")) 
                {
                    Elements tds = row.select("td");
                    if (tds.size() > 1)
                    {
                        if (tds.get(0).text().contains("Per capita income"))
                            System.out.println(s+","+tds.get(2).text());
                    }
                }
            }
            if(count%200==0)
            {
                Thread.sleep(300000);
                System.out.println("Stoped for 5 minutes");
            }
        }
Lavinia
  • 263
  • 1
  • 3
  • 16

2 Answers2

1

Update this line Document doc = Jsoup.connect(str).get(); to set the timeout as:

        Connection conn = Jsoup.connect(str);
        conn.timeout(300000); //5 minutes
        Document doc = conn.get();
Yogendra Singh
  • 33,927
  • 6
  • 63
  • 73
  • @LaviniaTomole - Did you try the above as well? If yes, got the same error? – Yogendra Singh Oct 20 '12 at 03:48
  • I am trying right now..so far it is working...but for 1500 zip codes it takes me a while...I will let you know....thank you! – Lavinia Oct 20 '12 at 03:50
  • I got the error "org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.moving.com/real-estate/city-profile/results.asp?Zip=629.." after 700 extractions. – Lavinia Oct 20 '12 at 04:26
  • 1
    Increase the timeout to 10 or 15 minutes and reduce your thread sleep, if you are still using that and then try. Should help. – Yogendra Singh Oct 20 '12 at 04:29
  • Do you think it is a good idea, if I do not use sleep()? I think the problem was fixed by setting the timeout, right? I mage the changes and I am waiting again :) – Lavinia Oct 20 '12 at 04:40
  • Yes. I don't think, sleep is doing anything other than interrupting your process. – Yogendra Singh Oct 20 '12 at 04:46
  • It was interrupted again, but I could read more 200 data(900 in total). So the changes helped me a lot. I set timeout at 10 minutes only. Can I set as much I want? What should be the optimal time for timeout, in your opinion? – Lavinia Oct 20 '12 at 04:52
0

Connection conn = Jsoup.connect(str); conn.timeout(0); /Infinite Timeout

Set the request timeouts (connect and read). If a timeout occurs, an IOException will be thrown. The default timeout is 3 seconds (3000 millis). A timeout of zero is treated as an infinite timeout.

Parameters:

millis - number of milliseconds before timing out connects or reads.

Returns:

this Connection, for chaining

Source: jsoup API

Set timeout to zero. This way you won'y have to stop for 5 minutes.

raam86
  • 6,785
  • 2
  • 31
  • 46
  • I tried to modify the code:Connection.timeout(300000);(I want 5 minutes) and I have the error "The method timeout(int) is undefined for the type Connection" – Lavinia Oct 20 '12 at 03:40