Java: How to efficiently process Zipfile reading and create byte[] using Multithreading and Async

Question

I am currently developing a method in the Service layer implementation where the method receives a .zip file (file size could go up to 600~700MB) as a Multipart file. Out of all the files zipped in that Multipart file, there are only 4-5 JSON files of interest to me which I am reading from the zip using ZipInputStream and storing them as String values for further usage.

Service class:

@Async("taskExecutor")
public CompletableFuture<ResponseEntity<?>> methodname(MultipartFile file){

    ZipEntry entry = null;
    try(ZipInputStream zipFileStream = new ZipInputStream(file.getInputStream())){
        while((entry = zipFileStream.getNextEntry) != null){
            String entryName = entry.getName();
            
            if(entryName.contains("<file1name>")){
            BufferedReader br = new BufferedReader(new InputStreamReader(zipFileStream));
            String value1 = br.lines().collect(Collectors.joining("\n"));
            zipFileStream.closeEntry();
            }
            
            if(entryName.contains("<file2name>")){
            BufferedReader br = new BufferedReader(new InputStreamReader(zipFileStream));
            String value2 = br.lines().collect(Collectors.joining("\n"));
            zipFileStream.closeEntry();
            }
            
            if(entryName.contains("<file3name>")){
            BufferedReader br = new BufferedReader(new InputStreamReader(zipFileStream));
            String value3 = br.lines().collect(Collectors.joining("\n"));
            zipFileStream.closeEntry();
            }
        }
    }
    
    //String value1 & String value2 merged based on some condition to finally prepare String value1.
    //some logic to prepare a file
    
    if(fileExists){
        //create byte[] and Httpheaders with content disposition and mediatype and send CompletableFuture<ResponseEntity<?>>
    }
}

I have annotated the method @Async (as I have created an Executor bean in config class), still I have not been able to figure out how can I run different processes of this methods asynchronously or in multi-threaded way to make the processing faster. The entire process still runs on single thread from that executor service pool.

Can anyone please advise how can I introduce asynchronous or multi thread processing in my above method, so that concurrent processes like

Reading the Zip file
Creating the final byte[]

can be done a little bit faster to reduce the overall response time.

try using ZipFile instead: https://stackoverflow.com/questions/51920911/how-can-i-unzip-huge-folder-with-multithreading-with-java-preferred-java8 — Marc Stroebel, Sep 15 '22 at 09:50
Using `contains` on the zip entry’s name very likely is wrong. In most cases, you want `equals`, sometime `endsWith` might be the right check for a path. Further, letting the fact aside that `String value1 = br.lines().collect();` isn’t valid code, you apparently want to read the entire entry into a single `String`, so you should do exactly that, rather than splitting it into lines and reassembling the lines to a string. And if you are doing exactly the same for the three entries, you should not duplicate (triple) the code. — Holger, Sep 15 '22 at 10:04
Hey Holger, I have added contains because the filename will have some dynamic contents prefixed which I have no way of knowing. I understand that's not the right way, maybe I'll try to use endsWith() as suggested. Now secondly, I am reading that into String because they are JSON files and I am using those strings to create JSON nodes and do some field-based comparisons. Now for the code duplication, I want to understand how can I write that piece of code without duplication and still achieve the same? Could you advise.. that will be helpful — New2Java, Sep 15 '22 at 10:18
Please see my update to the question for reading the entry to String — New2Java, Sep 15 '22 at 10:32
You avoid code duplication by putting the common code into a method and call the method as often as needed. And I’m not objecting the need to read into a `String`, but you shouldn’t split the data into lines, just to join them afterwards. See [this answer](https://stackoverflow.com/a/32352386/2711488) for an example of how to do this without dealing with lines. — Holger, Sep 16 '22 at 14:58

score 0 · Answer 1 · answered Sep 15 '22 at 11:01

0

store MultipartFile to temp file and try ZipFile (which supports streams ootb)

final ZipFile zipFile = new ZipFile("dummy.zip");
zipFile
  .stream()
  .parallel()
  .filter(entry -> entry.getName().matches("regexFile1")
    || entry.getName().matches("regexFile2")
    || entry.getName().matches("regexFile3")
  )
  .map(entry -> {
    try {
      return new EntryDto(entry.getName(), new String(zipFile.getInputStream(entry).readAllBytes(), StandardCharsets.UTF_8));
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
  })
  .map(dto -> {
    // custom logic
    return ...;
  })
  .collect(Collectors.toList());

dto class

class EntryDto {
    private String name;
    private String json;

    public EntryDto(String name, String json) {
        this.name = name;
        this.json = json;
    }

    public String getName() {
        return name;
    }

    public String getJson() {
        return json;
    }
}

answered Sep 15 '22 at 11:01

Marc Stroebel

2,295
1
12
21

1

but storing to disk and parallel processing may be slower than serial processing from memory using ZipInputStream. – Marc Stroebel Sep 15 '22 at 11:06
I doubt this will speed things up. Even if it processes the zip file in several threads, so actual processing might be faster, but you get the penalty of disk I/O which is incredibly slow compared to in-memory operations. – Jochen Reinhardt Sep 15 '22 at 11:07
in-memory-filesystem to the rescue: https://www.baeldung.com/jimfs-file-system-mocking – Marc Stroebel Sep 15 '22 at 11:12
To add com.google.jimfs dependency in pom, I might have to go for additional approvals. To Jochen's point, I did actually try to go with storing details in Dto/Entity objects, but I did not see many improvements. Is there Any way It can be worked with traditional Java way or commons.io way? – New2Java Sep 15 '22 at 14:50
Btw I am confused a little bit here. if I do try (ZipFile zipFile = new ZipFile(file.getOriginalFileName())) {}, will it still use disk operations to store and read entries? – New2Java Sep 15 '22 at 15:00
yes, but using an in-memory-filesystem to read/write files I/O is no more a bottleneck – Marc Stroebel Sep 16 '22 at 06:13
1

`ZipFile` doesn’t work with custom `FileSystem` implementations. But anyway, I wouldn’t rely on generic statements about performance. Most systems cache data in memory anyway, as long as you’re not forcing a sync with physical storage. Just try and measure… – Holger Sep 16 '22 at 14:47

Java: How to efficiently process Zipfile reading and create byte[] using Multithreading and Async

1 Answers1