-1

I have gone through the link of how to extract a .tar file and several link on SOF using Java. However, I didnt find any which can relate to my concerns which is multilevel or nested .tar/.tgz/.zip file. my concern is with something like below

Abc.tar.gz
    --DEF.tar
          --sample1.txt
          --sample2.txt 
    --FGH.tgz
          --sample3.txt
-sample4.txt    

This is the simple one which I can give here . As it can be in any compressed combination with the folder like .tar inside .tar and .gz and again .tgz and so on....

My problem is I am able to extract till the first level using Apache Commons Compress library. that is if Abc.tar.gz gets extracted then in the destination/output folder its only DEF.tar available . beyond that my extraction is not working.

I tried to give the output of first to the input to the second on the fly but I got stuck with FileNotFoundException. As at that point of time output file would have not been in place and the second extraction not able to get the file.

Pseudocode:

public class CommonExtraction {
   
    
    TarArchiveInputStream tar = null;
    if((sourcePath.trim().toLowerCase.endsWith(".tar.gz")) || sourcePath.trim().toLowerCase.endsWith(".tgz")) {
        try {
        tar=new TarArchiveInputStream(new GzipCompressorInputStream(new BufferedInputStream(new FileInputStream(sourcePath))));
        extractTar(tar,destPath)
        } catch (Exception e) {
            e.printStackTrace();
        }
        }
        }
        
        Public static void extractTar(TarArchiveInputStream tar, String outputFolder) {
        try{
        TarArchiveEntry entry;
        while (null!=(entry=(TarArchiveEntry)tar.getNextTarEntry())) {
        if(entry.getName().trim().toLowerCase.endsWith(".tar")){
        final String path = outputFolder + entry.getName()
        tar=new TarArchiveInputStream(new BufferedInputStream(new FileInputStream(path))) // failing as .tar folder after decompression from .gz not available at destination path
        extractTar(tar,outputFolder)
        }
        extractEntry(entry,tar,outputFolder)        
        }
        tar.close();
        }catch(Exception ex){
                 ex.printStackTrace();
        }
        }
        
        Public static void extractEntry(TarArchiveEntry entry , InputStream tar, String outputFolder){
        final String path = outputFolder + entry.getName()
        if(entry.isDirectory()){
        new File(path).mkdirs();
        }else{
        //create directory for the file if not exist
        }
        // code to read and write until last byte is encountered
        }
        
    }

Ps: please ignore the syntax and all in the code.

Olivier
  • 13,283
  • 1
  • 8
  • 24
Trips
  • 35
  • 8

2 Answers2

0

Try this

try (InputStream fi = file.getInputStream();
    InputStream bi = new BufferedInputStream(fi);
    InputStream gzi = new GzipCompressorInputStream(bi, false);
    ArchiveInputStream archive = new TarArchiveInputStream(gzi)) {

        withArchiveStream(archive, result::appendEntry);
}

As i see what .tar.gz and .tgz is same formats. And my method withArchiveEntry is:

private void withArchiveStream(ArchiveInputStream archInStream, BiConsumer<ArchiveInputStream, ArchiveEntry> entryConsumer) throws IOException {
    ArchiveEntry entry;
    while((entry = archInStream.getNextEntry()) != null) {
        entryConsumer.accept(archInStream, entry);
    }
}

private void appendEntry(ArchiveInputStream archive, ArchiveEntry entry) {

    if (!archive.canReadEntryData(entry)) {
        throw new IOException("Can`t read archive entry");
    }

    if (entry.isDirectory()) {
        return;
    }


    // And for example
    String content = new String(archive.readAllBytes(), StandardCharsets.UTF_8);
    System.out.println(content);
}
George_A
  • 131
  • 4
  • 1
    Perhaps you should mention that class `GzipCompressorInputStream` is part of Apache Commons? – Abra Jul 31 '22 at 05:53
  • Yes. org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream – George_A Jul 31 '22 at 11:47
  • @George_A Thanks for your reply. sorry, to try this, i need to understand it clearly. you want me to try the above one in the start itself ? where i am checking the file type```(.tar.gz or .tgz)``` not in the ```extractTar``` or ```extractEntry``` method(referring to my code snippet).However, i am confused with your method ```withArchiveStream(archive, result::appendEntry)``` what does it will contain ? and why you are passing the boolean ```false``` parameter in the ```GzipCompressorInputStream``` And also I have a doubt whether it will check all scenario or just I am able to see .gz type ? – Trips Jul 31 '22 at 14:49
  • I`m expand my answer – George_A Jul 31 '22 at 17:07
  • @George_A sorry but I am confused here. I am trying to understand whether its completely new approach or this i can use in my pseudocode provided. However, I am not able to find what is the ```result``` and appendEntry uses as you are not calling that method. Can you please elaborate a bit your sample code and if i can use this in my sample code provided. thanks – Trips Aug 01 '22 at 16:05
0

You have a recursive problem, so you can use recursion to solve it. Here is some pseudocode to show how it can be done:

public class ArchiveExtractor
{
    public void extract(File file)
    {
        List<File> files; // list of extracted files

        if(isZip(file))
            files = extractZip(file);
        else if(isTGZ(file))
            files = extractTGZ(file);
        else if(isTar(file))
            files = extractTar(file);
        else if(isGZip(file))
            files = extractGZip(file);

        for(File f : files)
        {
            if(isArchive(f))
                extract(f); // recursive call
        }
    }

    private List<File> extractZip(File file)
    {
        // extract archive and return list of extracted files
    }

    private List<File> extractTGZ(File file)
    {
        // extract archive and return list of extracted files
    }

    private List<File> extractTar(File file)
    {
        // extract archive and return list of extracted files
    }

    private List<File> extractGZip(File file)
    {
        // extract archive and return list of extracted file
    }
}

where:

  • isZip() tests if the file extension is zip
  • isTGZ() tests if the file extension is tgz
  • isTar() tests if the file extension is tar
  • isGZip() tests if the file extension is gz
  • isArchive() means isZip() || isTGZ() || isTar() || isGZip()

As for the directory where each archive is extracted: you are free to do as you want. If you process test.zip for example, you may extract in the same directory as where the archive is, or create the directory test and extract in it.

Olivier
  • 13,283
  • 1
  • 8
  • 24
  • thanks. I believe```public void extract(File file)```, here file will be ```input file/source file```. But I have a doubt regarding the destination path where it gets extracted. And also I didnt get about ```isArchive()```.As we have separate flag for each file type then what does``` isArchieve flag``` is used for ? does this for all file type at once ? Following, how to check for the different file type at the destination path once it start get extracting. – Trips Aug 03 '22 at 10:17
  • during the implementation of the above logic,i am stuck at the logic of extract the each file type and ```return list of extrcted files```.As most of the extrcted methods will be of type void and if i am changing to list of Files,it will not allow me as during extrction i have to go till write level of each file type. Can you suggest – Trips Aug 08 '22 at 02:15
  • @Trips Not sure to understand your problem. Every time you extract a file, you know where you extract it. So you can keep (or create) a `File` object that represents that file. – Olivier Aug 08 '22 at 07:31
  • if you see my pseudocode provided,```extractEntry``` this is simple static and void method which will be having read/write logic until the last byte of file will read. A/c to the recursive logic if i will extract the file,i am expected to return the list of files extracted.my concern is what i can return apart from null as here only read/write logic will be present. – Trips Aug 08 '22 at 10:58
  • @Trips Create an `ArrayList`; for every processed `extractEntry`, add a file to it; return the list. – Olivier Aug 08 '22 at 18:36
  • @Trips See [here](https://stackoverflow.com/a/7556307/12763954) for an example. – Olivier Aug 08 '22 at 18:43
  • again stuck with the starting issue where I was getting the error of FileNotFoundException. From the above link , I am getting the error ```java.io.FileNotFoundException (The system can not find the path specified)``` at the line ```final TarArchiveInputStream debInputStream = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", is)``` eventhough the input file path is available and I dont see any issue with the permission. It seems some locks is happening at this level. – Trips Aug 09 '22 at 15:33
  • @Trips Are you sure the exception is not raised by the line just before? The file is opened by `InputStream is = new FileInputStream(inputFile);`. – Olivier Aug 10 '22 at 07:28
  • sorry, my bad. thanks. It was failing at the above steps. However, I have figured out this error. But, Now I am getting ```java.io.IOException: Truncated TAR archive``` at the line ```IOUtils.copy(debInputStream, outputFileStream)``` in the ```unTar()``` method. any suggestion. – Trips Aug 10 '22 at 12:55