0

I was trying to scrape the data from a web page using Java Servlet, but I found out that the page is compressed. So when I make a URLConnection, it invokes to download the zipped file.

Can anyone help me with this? Actually, I would be visiting 1000s of pages like these, parse the table data using DOM and populate the database to make a query for some of the text words, and display the results. So I was wondering if this could make the process too slow.

Is there a way to do this without downloading the file? Any suggestions would be greatly appreciated. Thanks.

try{

        URL url = new URL("example.html.gz");
        URLConnection conn = url.openConnection();

         //FileInputStream instream= new FileInputStream(???What do I enter???);
         //GZIPInputStream ginstream =new GZIPInputStream(instream);
        conn.setAllowUserInteraction(false);
        InputStream urlStream = url.openStream();
        BufferedReader buffer = new BufferedReader(new InputStreamReader(urlStream));

        String t = buffer.readLine();
        while(t!=null){
            temp = temp + t ;
            t = buffer.readLine();
        }
user4035
  • 22,508
  • 11
  • 59
  • 94
Crocode
  • 3,056
  • 6
  • 26
  • 31
  • After a lot of searching I finally Found answer here: [here][1] [1]: http://stackoverflow.com/questions/11093153/how-to-read-compressed-html-page-with-content-encoding-gzip – Crocode May 26 '13 at 00:53

1 Answers1

2

Can you try this:

GZIPInputStream ginstream =new GZIPInputStream(conn.getInputStream());

The rest is same as your code.

Chris
  • 5,584
  • 9
  • 40
  • 58