1

I have source files in Cp1250 encoding. All of those file are in dirName directory or its subdirectories. I would like to merge them into one utf-8 file by adding their contents. Unfortunately I get empty line at the beginning of result file.

public static void processDir(String dirName, String resultFileName) {
    try {
        File resultFile = new File(resultFileName);
        BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(resultFile), "utf-8"));
        Files.walk(Paths.get(dirName)).filter(Files::isRegularFile).forEach((path) -> {
            try {
                Files.readAllLines(path, Charset.forName("Windows-1250")).stream().forEach((line) -> {
                    try {
                        bw.newLine();
                        bw.write(line);                     
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            } catch (Exception e) {
                e.printStackTrace();
            }
        });
        bw.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

The reason is that I don't know how to detect the first file in my stream.


I came up with extremely stupid solution which does not rely on streams so it is unsatisfactory:

public static void processDir(String dirName, String resultFileName) {
        try {
            File resultFile = new File(resultFileName);
            BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(resultFile), "utf-8"));
            Files.walk(Paths.get(dirName)).filter(Files::isRegularFile).forEach((path) -> {
                try {
                    Files.readAllLines(path, Charset.forName("Windows-1250")).stream().forEach((line) -> {
                        try {
                            if(resultFile.length() != 0){
                                bw.newLine();
                            }
                            bw.write(line);
                            if(resultFile.length() == 0){
                                bw.flush();
                            }
                        } catch (Exception e) {
                            e.printStackTrace();
                        }
                    });
                } catch (Exception e) {
                    e.printStackTrace();
                }
            });
            bw.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

Also I could use static boolean but that is total gibberish.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Yoda
  • 17,363
  • 67
  • 204
  • 344
  • 1
    Why not write the newLine at the end of each time? This is how most files are written. – Peter Lawrey Nov 18 '15 at 08:32
  • @PeterLawrey Then I get dangling empty line at the end. – Yoda Nov 18 '15 at 08:33
  • 2
    Again, this is how most text files are written. There is a new line at the end of the last line. – Peter Lawrey Nov 18 '15 at 08:34
  • Most specifically, there is a new line at the end of every line as written by `println` – Peter Lawrey Nov 18 '15 at 08:35
  • @PeterLawrey I could leave it as it is, but mathematically speaking output file with empty line at the end is not a sum of input files without those empty lines at the end. I'll leave the question for a while. Thanks for commentary. – Yoda Nov 18 '15 at 08:39
  • If you want a trivial concatenation, why not use java.io.Reader and java.io.Writer, and perform a direct "copy" from one to the other ? No more end of line guessing, just pure input/output copying of whats actually in the file. – GPI Nov 18 '15 at 08:45
  • @GPI Done it that way too, without streams, It worked. I wanted to polish my stream skills, because there aren't any. – Yoda Nov 18 '15 at 08:46
  • 1
    Yes, but your issue here is not with java 8 Streams, it is that the `readAllLines` method is **destructive**, as it strips line endings, leaving you with no clue wether there were some or not, and what they actually were (`\r`, `\n`, `\r\n`, ...) – GPI Nov 18 '15 at 08:48
  • 1
    You are discarding the end of the line so you have no idea which new line was used LF/CR, LF or CR or whether the last line had an new line, so I didn't think you were too worried about mathematical correctness. – Peter Lawrey Nov 18 '15 at 08:50
  • If you want to preserve the original new lines and just concatenate them, you could copy the original file without breaking it into lines at all. This would be closer to the original (apart from a change in character encoding) but also quite a bit faster. – Peter Lawrey Nov 18 '15 at 08:53
  • BTW If you read a file while is in the middle of being written, the is likely that the last line written will be incomplete and without a newline. Some might consider a text without a new line is a sign of a corrupt or incomplete file. – Peter Lawrey Nov 18 '15 at 08:56

2 Answers2

2

You can use the flatMap to create the stream of all lines of all files, then use flatMap again to interleave it with line separator, then use skip(1) to skip the leading separator like this:

public static void processDir(String dirName, String resultFileName) {
    try(BufferedWriter bw = Files.newBufferedWriter(Paths.get(resultFileName))) {
        Files.walk(Paths.get(dirName)).filter(Files::isRegularFile)
            .flatMap(path -> {
                try {
                    return Files.lines(path, Charset.forName("Windows-1250"));
                } catch (IOException e) {
                    throw new UncheckedIOException(e);
                }
            })
            .flatMap(line -> Stream.of(System.lineSeparator(), line))
            .skip(1)
            .forEach(line -> {
                try {
                    bw.write(line);
                } catch (IOException e) {
                    throw new UncheckedIOException(e);
                }
            });
    } catch (IOException e) {
        throw new UncheckedIOException(e);
    }
}

In general using flatMap+skip combination can help to solve many similar problems.

Also note the Files.newBufferedWriter method which is simpler way to create BufferedWriter. And don't forget about try-with-resources.

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
0

Rethink your strategy. If you want to join files and neither, remove nor convert, line terminators, there is no reason to process lines. It seems, the only reason for you to write code processing lines, is, that you have a demand to bail lambda expressions and streams into the solution and the only possibility offered by the current API is to process streams of lines. But obviously, they are not the right tool for the job:

public static void processDir(String dirName, String resultFileName) throws IOException {
    Charset cp1250 = Charset.forName("Windows-1250");
    CharBuffer buffer=CharBuffer.allocate(8192);
    try(BufferedWriter bw
          =Files.newBufferedWriter(Paths.get(resultFileName), CREATE, TRUNCATE_EXISTING)) {
        Files.walkFileTree(Paths.get(dirName), new SimpleFileVisitor<Path>() {
            @Override public FileVisitResult visitFile(
                             Path path, BasicFileAttributes attrs) throws IOException {
                try(BufferedReader r=Files.newBufferedReader(path, cp1250)) {
                    while(r.read(buffer)>0) {
                        bw.write(buffer.array(), buffer.arrayOffset(), buffer.position());
                        buffer.clear();
                    }
                }
                return FileVisitResult.CONTINUE;
            }
        });
        bw.close();
    }
}

Note how this solution solves the problems of your first attempt. You don’t have to deal with line terminators here, this code doesn’t even waste resources in trying to find them in the input. All it does, is performing the charset conversion on chunks of input data and writing them to the target. The performance difference can be significant.

Further, the code isn’t cluttered with catching exceptions, that you can’t handle. If an IOException occurs at any place of the operation, all pending resources are properly closed and the exception is relayed to the caller.

Granted, it just uses a good old inner class instead of a lambda expression. But it doesn’t reduce the readability compared to your attempt. If it still really bothers you that there is no lambda expression involved, you may check this question & answer for a way to bring them in again.

Community
  • 1
  • 1
Holger
  • 285,553
  • 42
  • 434
  • 765