1

I have a tar.gz file with a huge amount of small xml-files (slightly less than 1.5m)(no subdirectories). Now I want to iterate through those and I am trying to use apache commons compress to achieve that. I don't want to output or write anything to a new file as is often seen in similar topics. I just want to incrementally read the information (perfect would be to be able to stop at one point and continue on another run of the programm but that's secondary).

SO for starters I thought I should start small with something like that (the counter just exists for testing purposes to reduce time):

public static void readTar(String in) throws IOException {
    try (TarArchiveInputStream tarArchiveInputStream =
                 new TarArchiveInputStream(
                         new BufferedInputStream(
                                 new GzipCompressorInputStream(
                                         new FileInputStream(in))))){
        TarArchiveEntry entry;
        int counter = 0;
        while ((entry = tarArchiveInputStream.getNextTarEntry()) != null && counter < 1000) {
            counter++;
            System.out.println(entry.getFile());
        }
    }
}

But the the result of entry.getFile() is always null, so I cannot work with its content, while entry.getName() returns the expected result.

I would be glad if someone could point out my mistake.

Wolfone
  • 1,276
  • 3
  • 11
  • 31
  • 2
    Here's an example of reading from a tar file: http://thinktibits.blogspot.com/2013/01/read-extract-tar-file-java-example.html – Devon_C_Miller Nov 10 '18 at 03:48
  • Good reference! Thank you! Explains exactly what has to be done to solve the initial question, which is why I think this should be the accepted answer. If you post it with the relevant code-snippet I will accept that. Otherwise I would do that in a few days so the relevant code is in an answer that is not just a link for possible future reference. – Wolfone Nov 10 '18 at 11:52

1 Answers1

4

The explanation of the getFile method basically says that it's not useful for entries read from an archive.

https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/archivers/tar/TarArchiveEntry.html#getFile--

I believe you need to use "read":

https://commons.apache.org/proper/commons-compress/javadocs/api-1.18/org/apache/commons/compress/archivers/tar/TarArchiveInputStream.html#read-byte:A-int-int-

The other thing I do when figuring out how libraries work, is I will link the source and look at the library code to understand what is actually happening under the hood.

  • Shame on me! I was confused by the TarArchiveEntry's class doc and was asking myself how to properly construct such that I didn't bother to look into the methods docs as it seemed absolutely clear to me. Good advice btw. I still tend to forget that looking at the source of libraries may often be usefull! – Wolfone Nov 10 '18 at 11:51