2

I looked up sorting files in directory by size using java array list && How to sort an ArrayList by its elements size in Java?

My question is what is the best way for the Comparator to be implemented so that sort will be faster? I was told that the sort for 100k files should be done in seconds not in minutes, as the file sizes are in long. Is there a better way to implement the Comparator?

My Comparator is:

public static List<File> sortFilesBySize(List<File> xmlFileList) {
     xmlFileList.sort(Comparator.comparing(File::length).reversed());
     return xmlFileList;
}

where

private static List<File> xmlFileList = new ArrayList<File>();

xmlFileList is populated as:

pathList = pathList.subList(0,filterCount);
for (Path filePath : pathList)
    xmlFileList.add(filePath.toFile());

filterCount is how I filter by number of files to be sorted

and the sortFilesBySize is invoked as:

long startSortMillis = System.currentTimeMillis();
sortFilesBySize(xmlFileList);
long timeInMillis = System.currentTimeMillis() - startSortMillis;

By varying the number of files sorted as 5k, 10k 20k etc. I get

  1. 5k ----> 1329 ms
  2. 10k ---> 2808 ms
  3. 20k ---> 29790 ms
  4. 40k ---> 428408 ms
  5. 80k ---> 838658 ms
  6. 100k --> 1159034 ms

It can be observed that after 20k the sort takes minutes. Any suggestions how I can lower the sort time?

I also looked up https://docs.oracle.com/javase/8/docs/api/java/io/File.html to see if I could improve my current implementation, but nothing seemed to jump out.

Daniel
  • 21
  • 1
  • I believe your question belongs [here](https://codereview.stackexchange.com). – Cardinal System Feb 05 '18 at 00:05
  • 3
    Have you tried caching the file size? – MiguelKVidal Feb 05 '18 at 00:09
  • 2
    Have you considered that `File.length()` might be a relatively slow operation? Perhaps caching that would help, i.e. wrap `File` with a custom class containing `File` and the size, implement `Comparable`, then sort that. – Andreas Feb 05 '18 at 00:10
  • Probably a duplicate of [Efficiency of the way comparator works](https://stackoverflow.com/questions/32878398/efficiency-of-the-way-comparator-works) – Erwin Bolwidt Feb 05 '18 at 00:59

4 Answers4

2

It is indeed caused by the system calls on File.length(). The number of them increases more than linearly with the number of files. Do cache it as suggested. You will find that the sort time almost vanishes.

user207421
  • 305,947
  • 44
  • 307
  • 483
1

Try caching the length:

public static List<FileWithCachedLength> sortFilesBySize(List<FileWithCachedLength> xmlFileList) {
    xmlFileList.sort(Comparator.comparing(FileWithCachedLength::length).reversed());
    return xmlFileList;
}

Where:

public class FileWithCachedLength {
    private final File file;
    private final int length;
    // getters omitted
    public FileWithCachedLength( File f ) {
        file = f;
        length = f.length();
    }
}
MiguelKVidal
  • 1,498
  • 1
  • 15
  • 23
0

As others have noted, this is due to the cost of the File.length() method.

If you had a method like this:

public static <T, R> Function<T, R> memoized(Function<? super T, ? extends R> f) {
    Objects.requireNonNull(f);
    Map<T, R> map = new HashMap<>();
    return t -> map.computeIfAbsent(t, f);
}

You could use it like this:

public static List<File> sortFilesBySize(List<File> xmlFileList) {
    xmlFileList.sort(Comparator.comparing(memoized(File::length)).reversed());
    return xmlFileList;
}

Which would only incur the cost of calling File.length() once for each file.

clstrfsck
  • 14,715
  • 4
  • 44
  • 59
0

Thanks to all the answers and they were very helpful. I chose the implementation suggested by MiguelKVidal.

I started by reviewing Java Pairs - from http://www.baeldung.com/java-pairs.

After the implementation, my sort time was nice and low. But it is still longer to walk the directory path:

  1. 5k ----> Sort (16 ms) -----> WalkDirectoryPath(102 seconds)
  2. 10k --->Sort(31 ms)------->WalkDirectoryPath(94 seconds)
  3. 20k---->Sort(68 ms)------->WalkDirectoryPath(177 seconds)
  4. 40k---->Sort(131 ms)----->WalkDirectoryPath(328 seconds)
  5. 80k---->Sort(158 ms)----->WalkDirectoryPath(1219 seconds)
  6. 100k--->Sort(322 ms)---->WalkDirectoryPath(479 seconds)
Daniel
  • 21
  • 1