0

I have multiples html files in a folder : the code below list all of them and then parse them with Jsoup : I don't succeed to write the results of all these files parsed with Jsoup to a text file : I only get the result of the last file that was parsed. What's wrong ?

The code is :

package jsouppackage;

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

    public static void main(String[] args) {
        File input = new File("C:/html");
        File[] st = input.listFiles();
        for (int i = 0; i < st.length; i++) {
            if(st[i].isFile()){
                parse(st[i]);
            }
        }

    }

    private static void parse(File input ) {
        Document doc;

        try{

            doc = Jsoup.parse(input, "UTF-8", "");


            Elements ids = doc.select("div[id^=osdi] p");
            PrintWriter out = new PrintWriter("C:/html/output/output.txt", "UTF-8");

            for (Element id : ids){

                out.println("\n"+id.text());

            }
            out.close();

        }catch(IOException e){

        }
    }
}

Thanks for your help

Learn15773
  • 21
  • 1
  • 7

1 Answers1

3

Each time you invoke

PrintWriter out = new PrintWriter("C:/html/output/output.txt", "UTF-8");

you are creating new file (which means old file is deleted). What you want is let writer append data to existing file, or if such doesn't exist create one.

So if you want to set encoding you can use

OutputStreamWriter(OutputStream out, String charsetName)

and since it accepts OutputStream instead of Writer, to set file as output and make it append use

FileOutputStream(String name, boolean append)

where you set append parameter to true


In other words you can use

String outputFile = "C:/html/output/output.txt";
FileOutputStream fos = new FileOutputStream(outputFile, true);
PrintWriter out = new PrintWriter(new OutputStreamWriter(fos, "UTF-8"));

or to improve performance add buffering by using BufferedWriter decorator

String outputFile = "C:/html/output/output.txt";
FileOutputStream fos = new FileOutputStream(outputFile, true);
PrintWriter out = new PrintWriter(new BufferedWriter(
        new OutputStreamWriter(fos, "UTF-8")));

BTW, you shouldn't close your writers/readers/streams inside try block. You should do it in finally block. To make things easier you can use try-with-resources. Also never leave catch blocks empty, always at least print info about thrown exception by using e.printStackTrace();

So your parse method can look like

private static void parse(File input) {

    String outputFile = "C:/html/output/output.txt";

    try (FileOutputStream fos = new FileOutputStream(outputFile, true);
         PrintWriter      out = new PrintWriter(new BufferedWriter(
                    new OutputStreamWriter(fos, "UTF-8")))) {

        Document doc = Jsoup.parse(input, "UTF-8", "");
        Elements ids = doc.select("div[id^=osdi] p");

        for (Element id : ids) {
            out.println("\n" + id.text());
        }
        //out.close(); // this will be invoked automatically now
    } catch (IOException e) {
        e.printStackTrace();
    }
}
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Wow it's excellent, thanks ! Could I ask you what is "fos" and what does the BufferedWriter decorator ? – Learn15773 Oct 25 '14 at 11:46
  • Decorator is design pattern based on wrapping one instance of class with another instance of different class but from similar type. It was used in Java to create classes handling Input/Output to let us create our own combinations of behaviour we want. For example if we want to create stream which will be able to write to file we can use `FileOutputStream` (I created one and named this reference `fos`). With this class I was able to let content be appended, but I wasn't able to specify used encoding so I wrap this instance with `OutputStreamWriter` where I could add encoding as second argument. – Pshemo Oct 25 '14 at 11:57
  • Now problem is that Writer of class `OutputStreamWriter` doesn't have `println` method so I needed to wrap it on another writer which has this method like `PrintWriter`. – Pshemo Oct 25 '14 at 12:00