0

What's the most effective way of walking a folder hierarchy and obtaining a list of unqiue extensions?

This is very similar to this question, except that I'd like to do it from within Java.

There's an obvious recursive solution of checking File.isDirectory(), iterate over all children, checking extension and isDirectory on each and then keeping a unique collection (such as a Set), but I'm trying to see if there's something a bit more efficient.

Community
  • 1
  • 1
Randyaa
  • 2,935
  • 4
  • 36
  • 49

2 Answers2

2

There isn't a more efficient one. The algorithm will have to test each and every file if it's extension is one, that hasn't been seen before. So the best algorithm will have a complexity of O(n).

Recursing into all directories and throwing the exstensions of all files in a Set is the best we can do, to my opinion.


The dramatic performance gain may be a side effect of not useing a HashMap correctly ;) I see, that you iterate through the entire set instead of using the contains method. If you did that in your original version true, then it's cleear to me, that the performance was questionable.

I still expect, that extracting the extensions and just adding them to a HashSet is the most performant solution:

static String[] filenames = { "edit.txt", "my.notes.txt", "sheet.xlxs",
        ".bash", "README" };
static HashSet<String> exts = new HashSet<>();

public static void main(String[] args) {
    // we add every extension to a hashset
    for (String filename : filenames) {
        exts.add(getExtension(filename));
    }

    // just dumps the set contents
    for (String ext: exts) {
        System.out.println(ext);
    }
}

private static String getExtension(String filename) {
    String ext = "";

    // calculate the index only once
    int lastIndexOfDot = filename.lastIndexOf('.');

    // "README" and ".bash" are files with no extension!
    if (lastIndexOfDot > 0) {
        exts.add(filename.substring(lastIndexOfDot));
    }
    return ext;
}
Andreas Dolk
  • 113,398
  • 19
  • 180
  • 268
  • Found a slightly improved approach by using a fileNameFilter – Randyaa Mar 25 '12 at 07:18
  • It just hides the complexity. The `FileNameFilter` will have to look at each and every filename too. It's still `O(n)`. – Andreas Dolk Mar 25 '12 at 16:05
  • I agree, but I saw a dramatic performance increase once I added the filter. Perhaps working with the set is really where the slow-down was occurring. – Randyaa Mar 26 '12 at 03:30
0

A custom FileFilter:

public class FileExtensionFilter implements FilenameFilter {
    private Set<String> filteredExtensions;
    public FileExtensionFilter() {
        filteredExtensions = new HashSet<String>();
    }
    @Override
    public boolean accept(File dir, String name) {
        boolean accept = true;
        for (String filteredExtension:filteredExtensions) {
            accept = accept && !name.endsWith(filteredExtension);
        }
        return accept;
    }
    public void addFilteredExtension(String extension) {
        filteredExtensions.add(extension);
    }
}

Recursive method solution:

public Set<String> checkForExtensions(File file) {
    Set<String> extensions = new HashSet<String>();
    if (file.isDirectory()) {
        for (File f : file.listFiles(fileExtensionFilter)) {
            extensions.addAll(checkForExtensions(f));
        }
    } else {
        //NOTE: if you don't want the '.' in the extension you'll need to add a '+1' to the substring call
        String extension = file.getName().substring(Math.max(file.getName().lastIndexOf('.'),0));
        extensions.add(extension);
        fileExtensionFilter.addFilteredExtension(extension);
    }
    return extensions;
}

Originally I had the same solution without the FileExtensionFilter but noticed I could improve the efficiency a bit by dynamically adding to the filtering. The savings was exponential. I went from 47 seconds down to 700 milliseconds.

You could also clean up memory usage a bit more now by eliminating the Set all together since the FileExtensionFilter will contain a duplicate copy of all the extensions in the Set.

Randyaa
  • 2,935
  • 4
  • 36
  • 49