0

Trying to download all pdf files in the website and I have a bad code. I guess there is a better out there. Anyways here is it:

try {
        System.out.println("Download started");
        URL getURL = new URL("http://cs.lth.se/eda095/foerelaesningar/?no_cache=1");
        URL pdf;
        URLConnection urlC = getURL.openConnection();           

        InputStream is = urlC.getInputStream();

        BufferedReader buffRead = new BufferedReader(new InputStreamReader(is));

        FileOutputStream fos = null;

        byte[] b = new byte[1024];

        String line;
        double i = 1;
        int t = 1;

        int length;
        while((line = buffRead.readLine()) != null) {

            while((length = is.read(b)) > -1) {

                if(line.contains(".pdf")) {

                    pdf = new URL("http://fileadmin.cs.lth.se/cs/Education/EDA095/2015/lectures/" 
                    + "f" + i + "-" + t + "x" + t);


                    fos = new FileOutputStream(new File("fil" + i + "-" + t + "x" + t +  ".pdf"));
                    fos.write(b, 0, line.length());
                    i += 0.5;
                    t += 1;

                    if(t > 2) {
                        t = 1;
                    }
                }
            }
        }
        is.close();
        System.out.println("Download finished");
    } catch (MalformedURLException e) {

        e.printStackTrace();
    } catch (IOException e) {

        e.printStackTrace();
    }

The files I get is damage, BUT is there a better way to download the PDF files? Because on the site some of the files are f1-1x1, f1-2x2, f2-1x1.. But what IF the files were donalds.pdf stack.pdf etc..

So the question would be, How do I make my code better to download all the pdf files?

bCM
  • 97
  • 1
  • 3
  • 12

1 Answers1

2

Basically you are asking: "how can I parse HTML reliably; to identify all download links that point to PDF files".

Anything else (like what you have right now; to anticipate how links would/could/should look like) will be a constant source for grieve; because any update to your web site; or trying to run your code against another different web site is very likely to fail. And that is because HTML is complex and has so many flavors that you should simply forget about "easy" solutions to analyse HTML content.

In that sense: learn how to use an HTML parser; a first starting point could be Which HTML Parser is the best?

Community
  • 1
  • 1
GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • you are absolutely right. In this course I supposed to learn how to download the links that point to PDF files without the HTML parser. EDIT: I volunteer taking this course in my free time. – bCM May 16 '15 at 16:43
  • I guess then your learning goal should be come up with different strategies on how to do that **yourself**. What do you learn from other people exploring the option space for you? – GhostCat May 16 '15 at 16:45
  • At one part I can't argue against that because ur right. And the other part is I'm doing a project course with a few classmates and I wanted to implement a multiplayer option. In that case I had to learn network programming, so the time is short and running out. – bCM May 16 '15 at 16:53
  • The point is: what exactly do you want to learn? Is about more about the "networking" part; or is your actual question: what are best ways to parse HTML? If the later is true, you can try to use regular expressions; but of course: as powerful as they, regular expression are still not capable of "handling" generic HTML input. – GhostCat May 16 '15 at 17:11