1

I need to parse contents of a epub file and I am trying to see what would be the most efficient way to do it. The epub file may contain images, lot of text and occasionally videos too. Should I go for a FileInputStream or a FileReader?

Zooter
  • 79
  • 8

2 Answers2

2

As epub uses a ZIP archive structure I would propose to handle it as such. Find a small snippet below which list the content of an epub file.

Map<String, String> env = new HashMap<>();
env.put("create", "true");

Path path = Paths.get("foobar.epub");
URI uri = URI.create("jar:" + path.toUri());
FileSystem zipFs = FileSystems.newFileSystem(uri, env);
Path root = zipFs.getPath("/");
Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
    @Override
    public FileVisitResult visitFile(Path file,
            BasicFileAttributes attrs) throws IOException {
        print(file);
        return FileVisitResult.CONTINUE;
    }

    @Override
    public FileVisitResult preVisitDirectory(Path dir,
            BasicFileAttributes attrs) throws IOException {
        print(dir);
        return FileVisitResult.CONTINUE;
    }

    private void print(Path file) throws IOException {
        Date lastModifiedTime = new Date(Files.getLastModifiedTime(file).toMillis());
        System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n", 
                lastModifiedTime, Files.size(file), file);
    }
});

sample output

01.01.1970 00:59:59         0 /META-INF/
11.02.2015 16:33:44       244 /META-INF/container.xml
11.02.2015 16:33:44      3437 /logo.jpg
...

edit If you only want to extract files based on their names you could do it like shown in this snippet for the visitFile(...) method.

public FileVisitResult visitFile(Path file,
    BasicFileAttributes attrs) throws IOException {
    // if the filename inside the epub end with "*logo.jpg"
    if (file.endsWith("logo.jpg")) {
        // extract the file in directory /tmp/
        Files.copy(file, Paths.get("/tmp/",
            file.getFileName().toString()));
    }
    return FileVisitResult.CONTINUE;
}

Depending on how you want to process the files inside the epub you might also have a look on the ZipInputStream.

try (ZipInputStream in = new ZipInputStream(new FileInputStream("foobar.epub"))) {
    for (ZipEntry entry = in.getNextEntry(); entry != null; 
        entry = in.getNextEntry()) {
        System.out.printf("%td.%<tm.%<tY %<tH:%<tM:%<tS %9d %s\n",
                new Date(entry.getTime()), entry.getSize(), entry.getName());
        if (entry.getName().endsWith("logo.jpg")) {
            try (FileOutputStream out = new FileOutputStream(entry.getName())) {
                // process the file
            }
        }
    }
}

sample output

11.02.2013 16:33:44       244 META-INF/container.xml
11.02.2013 16:33:44      3437 logo.jpg
SubOptimal
  • 22,518
  • 3
  • 53
  • 69
  • This is a good aproach. One more addition: Method `visitFile` should decide if using an InputStream or a Reader to read the contents of each file. – Little Santi Dec 12 '15 at 10:05
  • Thanks SubOptimal for the comprehensive suggestion. @Little Santi, I did not understand the comment about visitFile deciding the approach to read contents of each file. – Zooter Dec 13 '15 at 16:18
  • @Zooter I meant that each file must be read either as text (through Reader APIs) or as binary (through Stream APIs). And that decission must be taken within the method `visitFile`. – Little Santi Dec 13 '15 at 19:16
  • @LittleSanti I agree, but this mainly depends on how "Zooter" want to process the files. – SubOptimal Dec 14 '15 at 08:59
  • @Zooter I updated my answer to provide another way. Depending on you requirements you need to choose the one which better fit your needs. – SubOptimal Dec 14 '15 at 09:00
0

The easiest way to read a whole file as bytes (and thats what you want if it's not plain text) is to use the java.nio.file.Files class:

byte[] content = Files.readAllBytes(Paths.get("example.epub"));

Advantages of this method:

  • less code = code gets more readable and has less potential for errors
  • java cares about opening and closing file

Edit:

In order to read a file really fast you can use java.nio as well. This time java.nio.channels.FileChannel:

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

// Load the file
FileChannel c = new FileInputStream("example.epub").getChannel();
MappedByteBuffer byteBuffer = c.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());

// Process the data
buffer.get(myByte, 1120, 50);

// when finished
c.close();

This will not read the whole file into memory but creates a link to the file and reads (buffers) only the parts you try to access. It will also recognize changes on the file and always return the latest content.

Dennis Kriechel
  • 3,719
  • 14
  • 40
  • 62
  • 1
    Thanks Dennis for the response. But the documentation says that this method server small files better and its not ideal to use it for larger files. I need to read epub files which can go upto 50 MB or more. – Zooter Dec 11 '15 at 15:22
  • yeah, this depends on how your code will work later on, i won't call 50 MB a larger file, but thats based on the pc it's running on (specially the memory). Of course you can process the file step by step, i will add an example – Dennis Kriechel Dec 11 '15 at 15:28
  • As stated here: http://stackoverflow.com/a/9094629/2546444 the example shown in the edit should be able to prepare 2 GB in under 10 milli-seconds, which should be fast enough for you ;) – Dennis Kriechel Dec 11 '15 at 15:36