6

I have an ArrayList in Java with a huge amount of files (~40.000 files). I need to sort these files ascending/descending by their date. Currently, I use a simple

Collections.sort(fileList, new FileDateComparator());

where FileDateComparator is

public class FileDateComparator implements Comparator<File>
{
       @Override
       public int compare(File o1, File o2)
       {
           if(o1.lastModified() < o2.lastModified())
               return -1;
           if(o1.lastModified()==o2.lastModified())
               return 0;
          return 1;
       }
    }

Sorting takes up up a too long time for me, like 20 seconds or more. Is there a more efficient way to realize this? I already tried Apache I/O LastModifiedFileComparator as comparator, but it seems to be implemented the same way, since it takes the same time.

user3581199
  • 239
  • 3
  • 17
  • Have you tried adding your objects to a TreeSet (http://docs.oracle.com/javase/6/docs/api/java/util/TreeSet.html) so they are sorted on the fly? – tom Apr 28 '14 at 11:57
  • 1
    Each call to `File.lastModified` makes a kernel call, so if you want to avoid that overhead you'll have to read the entire list once and keep it in memory (you can use a `Map` to do this. But it's still going to take time to load this cache, because again, you need to make a kernel call for each timestamp. – kdgregory Apr 28 '14 at 12:02
  • TreeSet gives you log(n) lookup, not linear.one structure in Apache Commons Collections that may be what you are looking for, the TreeList http://commons.apache.org/proper/commons-collections/javadocs/api-3.2.1/org/apache/commons/collections/list/TreeList.html – Asif Bhutto Apr 28 '14 at 12:03
  • 1
    See also http://stackoverflow.com/questions/6321180/how-expensive-is-file-exists-in-java – Raedwald Apr 28 '14 at 12:12
  • Not quite a duplicate of [_Best way to list files in Java, sorted by Date Modified?_](http://stackoverflow.com/questions/203030/best-way-to-list-files-in-java-sorted-by-date-modified), as this question is about the performance problem of sorting a long list of files. – Raedwald Apr 28 '14 at 12:26
  • Do you really need to sort the list at all? Do you just need the most recently modified file? If so, you can do it in one pass, which will be about 15 times faster. – Raedwald Apr 28 '14 at 12:27
  • See also http://stackoverflow.com/questions/11907273/java-arrays-sort-taking-a-long-time – Raedwald Apr 28 '14 at 12:32

4 Answers4

4

I think you need to cache the modification times to speed this up. You could e.g. try something like this:

class DatedFile {
  File f;
  long moddate;

  public DatedFile(File f, long moddate) {
    this.f = f;
    this.moddate = moddate;
  }
};


ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>();
for (File f: fileList) {
  datedFiles.add(new DatedFile(f, f.lastModified()));
}
Collections.sort(fileList, new FileDateComparator());
ArrayList<File> sortedFiles = new ArrayList<File>();
for (DatedFile f: datedFiles) {
  sortedFiles.add(f.f);
}

(with an appropriate FileDateComparator implementation)

thejh
  • 44,854
  • 16
  • 96
  • 107
  • This will give a benefit, but the O/S itself will cache the file modification times, so it might not be that dramatic an improvement. – Raedwald Apr 28 '14 at 12:06
  • @Raedwald Syscalls are pretty expensive, so I would expect a relatively big improvement. Also, I think that it's likely that the OP got the list of files by listing the directory, which at least on linux implies that the OS has already cached the modification times. – thejh Apr 28 '14 at 12:09
2

Sorting is O(n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that is about 30,000 comparisons per second. So each comparison is taking about 100,000 clock cycles. That can not be due to CPU-bound processing. The sorting is almost certainly I/O bound rather than CPU bound. Disk seeks are particularly expensive.

You might be able to reduce the time by using multi-threading to reduce the impact of disk seeks. That is, by having several reads queued and waiting for the disk drive to provide their data. To do that, use a (concurrent) map that maps file names to modification times, and populate that map using multiple threads. Then have your sort method use that map rather than use File.lastModified() itself.

Even if you populated that map with only one thread, you would gain a little benefit because your sort method would be using locally cached modification times, rather than querying the O/S every time for the modification times. The benefit of that caching might not be large, because the O/S itself is likely to cache that information.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
1

Java's array .sort() is (from about Java 6) actually TimSort [ http://svn.python.org/projects/python/trunk/Objects/listsort.txt ], the fastest general purpose #sort out there (much better than qsort in many situations); you won't be able to sort anything noticeably faster without a heuristic.

"like 20 seconds or more" signifies to me that your problem is probably the famous ApplicationProfilingSkippedByDeveloperException - do a profiling and locate the exact bottleneck. I'd go with the OS file I/O as one; doing a native request of the file attributes in batch, caching the results and then processing them at once seems the only sensible solution here.

0

You need to cache the lastModified() One way you can do this is in the Comparator itself.

public class FileDateComparator implements Comparator<File> {
   Map<File, Long> lastModifiedMap = new HashMap<>();

   Long lastModified(File f) {
       Long ts = lastModifiedMap.get(f);
       if (ts == null)
           lastModifiedMap.put(f, ts = f.lastModified());
       return ts;
   }

   @Override
   public int compare(File f1, File f2) {
       return lastModified(f1).compareTo(lastModified(f2));
   }
}

This will improve performance by only looking up the modified date of each file once.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130