How can I unzip huge folder with multithreading with java - preferred java8?

Question

Reffering to : http://www.pixeldonor.com/2013/oct/12/concurrent-zip-compression-java-nio/

I'm trying to unzip 5GB zipped file , average it takes me about 30 min and it is a lot for our app , I'm trying to reduce time.

I've tried a lot of combination , changed buffer size (by default my write chunk is 4096 bytes) , changed NIO methods , libraries , all results are pretty the same.

One thing still didn't try is to split zipped files by chunks , so read it by multithread chunks.

The snippet code is:

  private static ExecutorService e = Executors.newFixedThreadPool(20);
  public static void main(String argv[]) {
        try {
            String selectedZipFile = "/Users/xx/Documents/test123/large.zip";
            String selectedDirectory = "/Users/xx/Documents/test2";
            long st = System.currentTimeMillis();

            unzip(selectedDirectory, selectedZipFile);

            System.out.println(System.currentTimeMillis() - st);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


public static void unzip(String targetDir, String zipFilename) {
    ZipInputStream archive;
            try {
                List<ZipEntry> list = new ArrayList<>();
                archive = new ZipInputStream(new BufferedInputStream(new FileInputStream(zipFilename)));
                ZipEntry entry;
                while ((entry = archive.getNextEntry()) != null) {
                    list.add(entry);
                }

                for (List<ZipEntry> partition : Lists.partition(list, 1000)) {
                    e.submit(new Multi(targetDir, partition, archive));
                }
            } catch (Exception e){
                e.printStackTrace();
            }
}

and the runnable is :

  static class Multi implements Runnable {

    private List<ZipEntry> partition;
    private ZipInputStream zipInputStream;
    private String targetDir;

    public Multi(String targetDir, List<ZipEntry> partition, ZipInputStream zipInputStream) {
        this.partition = partition;
        this.zipInputStream = zipInputStream;
        this.targetDir = targetDir;
    }

    @Override
    public void run() {
        for (ZipEntry entry : partition) {
            File entryDestination = new File(targetDir, entry.getName());
            if (entry.isDirectory()) {
                entryDestination.mkdirs();
            } else {
                entryDestination.getParentFile().mkdirs();

                BufferedOutputStream output = null;
                try {
                    int n;
                    byte buf[] = new byte[BUFSIZE];
                    output = new BufferedOutputStream(new FileOutputStream(entryDestination), BUFSIZE);
                    while ((n = zipInputStream.read(buf, 0, BUFSIZE)) != -1) {
                        output.write(buf, 0, n);
                    }
                    output.flush();


                } catch (FileNotFoundException e1) {
                    e1.printStackTrace();
                } catch (IOException e1) {
                    e1.printStackTrace();
                } finally {

                    try {
                        output.close();
                    } catch (IOException e1) {
                        e1.printStackTrace();
                    }

                }
            }
        }
    }
}

But for reason it stores only directories without files content...

My Question is: what is the right way to make chunks with multithread over large zip file regarding the way of the "compression" article mentioned above?

score 8 · Accepted Answer · answered Aug 19 '18 at 19:25

A ZipInputStream is a single stream of data, it cannot be split.

If you want multi-threaded unzipping, you need to use ZipFile. With Java 8 you even get the multi-threading for free.

public static void unzip(String targetDir, String zipFilename) {
    Path targetDirPath = Paths.get(targetDir);
    try (ZipFile zipFile = new ZipFile(zipFilename)) {
        zipFile.stream()
               .parallel() // enable multi-threading
               .forEach(e -> unzipEntry(zipFile, e, targetDirPath));
    } catch (IOException e) {
        throw new RuntimeException("Error opening zip file '" + zipFilename + "': " + e, e);
    }
}

private static void unzipEntry(ZipFile zipFile, ZipEntry entry, Path targetDir) {
    try {
        Path targetPath = targetDir.resolve(Paths.get(entry.getName()));
        if (Files.isDirectory(targetPath)) {
            Files.createDirectories(targetPath);
        } else {
            Files.createDirectories(targetPath.getParent());
            try (InputStream in = zipFile.getInputStream(entry)) {
                Files.copy(in, targetPath, StandardCopyOption.REPLACE_EXISTING);
            }
        }
    } catch (IOException e) {
        throw new RuntimeException("Error processing zip entry '" + entry.getName() + "': " + e, e);
    }
}

You might also want to check out this answer, which uses FileSystem to access the zip file content, for a true Java 8 experience.

Checking your comment :) , Btw, what can give the lowest handling time ? Either "walk" or your's answer ? — VitalyT, Aug 19 '18 at 19:34
@VitalyT runtime will depend heavily on the target system - primarily IO speed, number of CPU cores and CPU speed - times will vary a lot between machines. — Hulk, Aug 20 '18 at 05:28
Thanks I know this, I'm trying to measure on the same machine with different scenarios to find out the ideal parameters...I need to understand conceptually the best practice with unzip multithreaded. Currently it decrease by 10%...but it not enough... — VitalyT, Aug 20 '18 at 05:41
@VitalyT Multi-threading will likely not help much, unless it was the CPU that was the performance bottle-neck. It is more likely your harddisk that can't keep up, so multi-threading just means more threads waiting on the disk. — Andreas, Aug 20 '18 at 16:31

score 0 · Answer 2 · answered Aug 19 '18 at 19:58

Here a parallel version leveraging FileSystem. You should tweek it a bit (e.g. actual use streaming, add error handling). But it should be a decent start.

import java.io.IOException;
import java.net.URI;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.HashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.function.Consumer;

public class ParallelUnzip {

    static class UnzipVisitor extends SimpleFileVisitor<Path> {
        private Consumer<Path> unzipper;

        public UnzipVisitor(Consumer<Path> unzipper) {
            this.unzipper = unzipper;
        }
        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            if (Files.isRegularFile(file)) {
                unzipper.accept(file);
            }
            return FileVisitResult.CONTINUE;
        }
    }

    // I would not risk creating directories in parallel, so adding synchronized here
    synchronized static void createDirectories(Path path) throws IOException {
        if (!Files.exists(path.getParent())) {
            Files.createDirectories(path.getParent());
        }
    }

    public static void main(String[] args) throws IOException, InterruptedException {

        FileSystem fs = FileSystems.newFileSystem(URI.create("jar:file:/tests.zip"), new HashMap<>());
        Path root = fs.getRootDirectories().iterator().next();
        Path target = Paths.get("target");

        ExecutorService executor = Executors.newFixedThreadPool(2);

        Files.walkFileTree(root, new UnzipVisitor((path) -> {
            System.out.println(Thread.currentThread().getName() + " " + path.toAbsolutePath().toString());

            executor.submit(() -> {
                try {
                    Path t = target.resolve(path.toString().substring(1));

                    createDirectories(t);

                    System.out.println("Extracting with thread " + Thread.currentThread().getName() + " File: "
                            + path.toAbsolutePath().toString() + " -> " + t.toAbsolutePath().toString());
                    // Should be using streaming here
                    byte[] bytes = Files.readAllBytes(path);
                    Files.write(t, bytes);
                } catch (Exception ioe) {
                    ioe.printStackTrace();
                    throw new RuntimeException(ioe);
                }
            });

        }));

        executor.shutdown();
        executor.awaitTermination(1000, TimeUnit.SECONDS);
    }
}

@ k5_ Thanks , trying your solution...hope that it will decrease more than 10% of unzipping time ...:) — VitalyT, Aug 20 '18 at 05:42
@ k5_ , Btw, I saw that ' FileSystem fs = FileSystems.newFileSystem(URI.create("jar:file:/tests.zip"), new HashMap<>());' took about 5 min to load....why you need this ? not easier to use Files or path like a comment before ? — VitalyT, Aug 20 '18 at 05:44

How can I unzip huge folder with multithreading with java - preferred java8?

2 Answers2

Linked