3

I've done some testing with Streams in special with DirectoryStreams of the nio-package. I simply try to get a list of all files in a directory sorted by last modified date and size.

The JavaDoc of old File.listFiles() stated a Note to the method in Files:

Note that the Files class defines the newDirectoryStream method to open a directory and iterate over the names of the files in the directory. This may use less resources when working with very large directories.

I run the code down below a lot of times (first three times below):

First-run:

Run time of Arrays.sort: 1516
Run time of Stream.sorted as Array: 2912
Run time of Stream.sorted as List: 2875

Second-run:

Run time of Arrays.sort: 1557
Run time of Stream.sorted as Array: 2978
Run time of Stream.sorted as List: 2937

Third-run:

Run time of Arrays.sort: 1563
Run time of Stream.sorted as Array: 2919
Run time of Stream.sorted as List: 2896

My question is: Why do the streams perform so bad?

import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class FileSorter {

  // This sorts from old to young and from big to small
  Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
    int sorter = 0;
    try {
      FileTime lm1 = Files.getLastModifiedTime(o1);
      FileTime lm2 = Files.getLastModifiedTime(o2);
      if (lm2.compareTo(lm1) == 0) {
        Long s1 = Files.size(o1);
        Long s2 = Files.size(o2);
        sorter = s2.compareTo(s1);
      } else {
        sorter = lm1.compareTo(lm2);
      }
    } catch (IOException ex) {
      throw new UncheckedIOException(ex);
    }
    return sorter;
  };

  public String[] getSortedFileListAsArray(Path dir) throws IOException {
    Stream<Path> stream = Files.list(dir);
    return stream.sorted(timeSizeComparator).
            map(Path::getFileName).map(Path::toString).toArray(String[]::new);
  }

  public List<String> getSortedFileListAsList(Path dir) throws IOException {
    Stream<Path> stream = Files.list(dir);
    return stream.sorted(timeSizeComparator).
            map(Path::getFileName).map(Path::toString).collect(Collectors.
            toList());
  }

  public String[] sortByDateAndSize(File[] fileList) {
    Arrays.sort(fileList, (File o1, File o2) -> {
      int r = Long.compare(o1.lastModified(), o2.lastModified());
      if (r != 0) {
        return r;
      }
      return Long.compare(o1.length(), o2.length());
    });
    String[] fileNames = new String[fileList.length];
    for (int i = 0; i < fileNames.length; i++) {
      fileNames[i] = fileList[i].getName();
    }
    return fileNames;
  }

  public static void main(String[] args) throws IOException {
    // File (io package)
    File f = new File("C:\\Windows\\system32");
    // Path (nio package)
    Path dir = Paths.get("C:\\Windows\\system32");

    FileSorter fs = new FileSorter();

    long before = System.currentTimeMillis();
    String[] names = fs.sortByDateAndSize(f.listFiles());
    long after = System.currentTimeMillis();
    System.out.println("Run time of Arrays.sort: " + ((after - before)));

    long before2 = System.currentTimeMillis();
    String[] names2 = fs.getSortedFileListAsArray(dir);
    long after2 = System.currentTimeMillis();
    System.out.
            println("Run time of Stream.sorted as Array: " + ((after2 - before2)));

    long before3 = System.currentTimeMillis();
    List<String> names3 = fs.getSortedFileListAsList(dir);
    long after3 = System.currentTimeMillis();
    System.out.
            println("Run time of Stream.sorted as List: " + ((after3 - before3)));
  }
}

Update

After applying the code from Peter I have this results:

Run time of Arrays.sort: 1615
Run time of Stream.sorted as Array: 3116
Run time of Stream.sorted as List: 3059
Run time of Stream.sorted as List with caching: 378

Update 2

After doing some research on the solution of Peter, I can say, that reading file attributes with for ex. Files.getLastModified must be a heavy crunch. Changing only that part in Comparator to:

  Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
    File f1 = o1.toFile();
    File f2 = o2.toFile();
    long lm1 = f1.lastModified();
    long lm2 = f2.lastModified();
    int cmp = Long.compare(lm2, lm1);
    if (cmp == 0) {
      cmp = Long.compare(f2.length(), f1.length());
    }
    return cmp;
  };

Gets the even better result on my computer:

Run time of Arrays.sort: 1968
Run time of Stream.sorted as Array: 1999
Run time of Stream.sorted as List: 1975
Run time of Stream.sorted as List with caching: 488

But as you can see, caching the object is the much best way. And as jtahlborn mentioned, it is a kind of stable sort.

Update 3 (best solution I've found)

After a bit more research, I've seen, that the methods Files.lastModified and Files.size, both do a huge job on a same thing: Attributes. So I made three versions of the PathInfo class to test:

  1. Peters version as described down below
  2. An old style File version, where I do a Path.toFile() once in the constructor and get all values from that file with f.lastModified and f.length
  3. An version of Peters solution, but now I read an Attribute object with Files.readAttributes(path,BasicFileAttributes.class) and done things on this.

Putting it all in a loop for doing it 100 times each, I came up with these results:

After doing all hundred times
Mean performance of Peters solution: 432.26
Mean performance of old File solution: 343.11
Mean performance of read attribute object once solution: 255.66

Code in constructor of PathInfo for the best solution:

public PathInfo(Path path) {
  try {
    // read the whole attributes once
    BasicFileAttributes bfa = Files.readAttributes(path, BasicFileAttributes.class);
    fileName = path.getFileName().toString();
    modified = bfa.lastModifiedTime().toMillis();
    size = bfa.size();
  } catch (IOException ex) {
    throw new UncheckedIOException(ex);
  }
}

My result: Never read attributes twice and caching in an Object is bursting performance.

aw-think
  • 4,723
  • 2
  • 21
  • 42
  • you run lm2 compareTo lm1 twice in your Path based comparator. not the end of the world, but could affect times. also, why are you using Long for the file size (instead of long). – jtahlborn Jun 18 '15 at 16:44
  • @jtahlborn You're right, that could be one part of the problem. – aw-think Jun 18 '15 at 17:18
  • @jtahlborn I've updated my answer/solution so this maybe could be the "fastet" solution to my problem without parallel streaming?! – aw-think Jun 19 '15 at 06:23

1 Answers1

4

Files.list() is a O(N) operation whereas sorting is O(N log N). It is far more likely that the operations inside the sorting which matter. Given the comparisons don't do the same thing, this is the most likely explanation. There is a lot of files with the same modification date under C:/Windows/System32 meaning the size would be checked quite often.

To show that most of the time is not spent in FIles.list(dir) Stream, I have optimise the comparison so the data about a file is only obtained once per file.

import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class FileSorter {

    // This sorts from old to young and from big to small
    Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
        int sorter = 0;
        try {
            FileTime lm1 = Files.getLastModifiedTime(o1);
            FileTime lm2 = Files.getLastModifiedTime(o2);
            if (lm2.compareTo(lm1) == 0) {
                Long s1 = Files.size(o1);
                Long s2 = Files.size(o2);
                sorter = s2.compareTo(s1);
            } else {
                sorter = lm1.compareTo(lm2);
            }
        } catch (IOException ex) {
            throw new UncheckedIOException(ex);
        }
        return sorter;
    };

    public String[] getSortedFileListAsArray(Path dir) throws IOException {
        Stream<Path> stream = Files.list(dir);
        return stream.sorted(timeSizeComparator).
                map(Path::getFileName).map(Path::toString).toArray(String[]::new);
    }

    public List<String> getSortedFileListAsList(Path dir) throws IOException {
        Stream<Path> stream = Files.list(dir);
        return stream.sorted(timeSizeComparator).
                map(Path::getFileName).map(Path::toString).collect(Collectors.
                toList());
    }

    public String[] sortByDateAndSize(File[] fileList) {
        Arrays.sort(fileList, (File o1, File o2) -> {
            int r = Long.compare(o1.lastModified(), o2.lastModified());
            if (r != 0) {
                return r;
            }
            return Long.compare(o1.length(), o2.length());
        });
        String[] fileNames = new String[fileList.length];
        for (int i = 0; i < fileNames.length; i++) {
            fileNames[i] = fileList[i].getName();
        }
        return fileNames;
    }

    public List<String> getSortedFile(Path dir) throws IOException {
        return Files.list(dir).map(PathInfo::new).sorted().map(p -> p.getFileName()).collect(Collectors.toList());
    }

    static class PathInfo implements Comparable<PathInfo> {
        private final String fileName;
        private final long modified;
        private final long size;

        public PathInfo(Path path) {
            try {
                fileName = path.getFileName().toString();
                modified = Files.getLastModifiedTime(path).toMillis();
                size = Files.size(path);
            } catch (IOException ex) {
                throw new UncheckedIOException(ex);
            }
        }

        @Override
        public int compareTo(PathInfo o) {
            int cmp = Long.compare(modified, o.modified);
            if (cmp == 0)
                cmp = Long.compare(size, o.size);
            return cmp;
        }

        public String getFileName() {
            return fileName;
        }
    }

    public static void main(String[] args) throws IOException {
        // File (io package)
        File f = new File("C:\\Windows\\system32");
        // Path (nio package)
        Path dir = Paths.get("C:\\Windows\\system32");

        FileSorter fs = new FileSorter();

        long before = System.currentTimeMillis();
        String[] names = fs.sortByDateAndSize(f.listFiles());
        long after = System.currentTimeMillis();
        System.out.println("Run time of Arrays.sort: " + ((after - before)));

        long before2 = System.currentTimeMillis();
        String[] names2 = fs.getSortedFileListAsArray(dir);
        long after2 = System.currentTimeMillis();
        System.out.println("Run time of Stream.sorted as Array: " + ((after2 - before2)));

        long before3 = System.currentTimeMillis();
        List<String> names3 = fs.getSortedFileListAsList(dir);
        long after3 = System.currentTimeMillis();
        System.out.println("Run time of Stream.sorted as List: " + ((after3 - before3)));
        long before4 = System.currentTimeMillis();
        List<String> names4 = fs.getSortedFile(dir);
        long after4 = System.currentTimeMillis();
        System.out.println("Run time of Stream.sorted as List with caching: " + ((after4 - before4)));
    }
}

This prints on my laptop.

Run time of Arrays.sort: 1980
Run time of Stream.sorted as Array: 1295
Run time of Stream.sorted as List: 1228
Run time of Stream.sorted as List with caching: 185

As you can see, about 85% of the time is spent obtaining the modification date and size of the files repeatedly.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Not to mention this provides a stable sort if the files are being touched while you are sorting! – jtahlborn Jun 18 '15 at 17:03
  • Nice explanation! Thank you, thought that this is my Comparator – aw-think Jun 18 '15 at 17:04
  • @jtahlborn Like the javadoc explained – aw-think Jun 18 '15 at 17:04
  • @NwDx streams provide a nice way to produce a caching object to sort from. This reduces repeated work. – Peter Lawrey Jun 18 '15 at 17:04
  • @NwDx some caching is done by the OS, but calling the OS is still more expensive than looking up a `long`. Looks like about 7x difference. Can you see I cached in Java by mapping each Path to a PathInfo object. – Peter Lawrey Jun 18 '15 at 17:12
  • @PeterLawrey I just run your code and get the results (updated in my question). Wow, there seems to be some problem in my system? I have a Win8.1 x64 but run the code in x86 JDK/JRE (have problems with x64 Java). – aw-think Jun 18 '15 at 17:13
  • @PeterLawrey Ah, now I got the clue with the "get information from a file **once**", this a quiet nice. – aw-think Jun 18 '15 at 17:28
  • @jtahlborn I mean this [JavaDocs](http://docs.oracle.com/javase/8/docs/api/java/nio/file/DirectoryStream.html) explains, that the DirectoryStream Iterator is not able to reflect changes to file-system while it is iterating over it: _The iterator is weakly consistent. It is thread safe but does not freeze the directory while iterating, so it may (or may not) reflect updates to the directory that occur after the DirectoryStream is created_ . I think if you are sorting on a stream, it never reflects the touched files. Or am I wrong? – aw-think Jun 18 '15 at 17:46
  • @NwDx - that javadoc only means that you may not see files come and go. The file _attributes_ (size, last modified) are unrelated to the DirectoryStream, and that's what i was referring to. your original impl could see different mod times _while sorting_ which could break the sort. This version caches the sorting values so that you get a stable sort. – jtahlborn Jun 18 '15 at 18:00
  • @jtahlborn Could you explain me why my orginial impl will break? Files.list() creates an DirectoryStream.iterator, and when will it break with the attributes? Ah, now I know what you mean. Is it really sequentiel in a stream? – aw-think Jun 18 '15 at 18:06
  • @NwDx - when you are sorting, a given Path could be passed to the comparator multiple times. Each time, you get the last modified attribute. if that value changes while sorting, your sort becomes invalid. – jtahlborn Jun 18 '15 at 18:17
  • @jtahlborn Thank you for your explanation. – aw-think Jun 18 '15 at 18:36
  • 1
    @PeterLawrey I've updated my answer/solution so this maybe could be the "fastet" solution to my problem without parallel streaming?! – aw-think Jun 19 '15 at 06:24