8

I need to extract extensions from file names.

I know this can be done for single extensions like .gz or .tar by using filePath.lastIndexOf('.') or using utility methods like FilenameUtils.getExtension(filePath) from Apache commons-io.

But, what if I have a file with an extension like .tar.gz? How can I manage files with extensions that contain . characters?

grkvlt
  • 2,577
  • 1
  • 21
  • 38
Bernice
  • 2,552
  • 11
  • 42
  • 74
  • 4
    Actually, the extension is not `.tar.gz`. The extension - by definition - is `.gz`. The `.tar` part is only for informational purposes. If the file ended with `.test.zip` you would consider the extension `.zip`, wouldn't you? – Thorsten Dittmar Jul 16 '13 at 12:38
  • 1
    and `commons-io-2.4.jar` in this case what you would expect ? – vels4j Jul 16 '13 at 12:40
  • @ThorstenDittmar It is still reasonable to handle *.tar.gz reasonable, just like *.tgz, mind you. – Ingo Jul 16 '13 at 12:40
  • oh it's true I haven't realized that @ThorstenDittmar. So .tar.gz and .gz both have the same File properties (e.g. icon) in this case? – Bernice Jul 16 '13 at 12:40
  • If you don't want to use commons-io, there is a similar method in Guava; `Files.getFileExtension(filePath)` - but it uses the same `lastIndexOf('.')` technique. – grkvlt Jul 21 '13 at 14:09
  • 1
    Why has this been closed? The question is reasonable, and *demonstrate[s] a minimal understanding of the problem* as far as I can see. Certainly, there was enough information for me to provide an answer, which was then accepted. I have edited the question slightly to clarify this, and the edit has been accepted. – grkvlt Jul 22 '13 at 13:29
  • @vels4j In that case `commons-io-2.4.jar` has the extension `jar`. For other extensions, the mappings should look like this: `commons-io-2.4.tar` = `tar`, `commons-io-2.4.tar.bz2` = `tar.bz2`, `commons-io-2.4.tgz.shar` = `tgz.shar`, `commons-io-2.4.tgz` = `tgz`, `commons-io-2.4.jar.md5` = `md5`. Note the last one, this is an MD5 signature of the Jar file, so the extension is `md5` not `jar.md5`. – grkvlt Jul 22 '13 at 13:35
  • @grkvlt I knew, but what if you get an unknown type. That is how this question is closed. – vels4j Jul 22 '13 at 14:04

4 Answers4

5

If you know what extensions are important, you can simply check for them explicitly. You would have a collection of known extensions, like this:

List<String> EXTS = Arrays.asList("tar.gz", "tgz", "gz", "zip");

You could get the (first) longest matching extension like this:

String getExtension(String fileName) {
  String found = null;
  for (String ext : EXTS) {
    if (fileName.endsWith("." + ext)) {
      if (found == null || found.length() < ext.length()) {
        found = ext;
      }
    }
  }
  return found;
}

So calling getExtension("file.tar.gz") would return "tar.gz".

If you have mixed-case names, perhaps try changing the check to filename.toLowerCase().endsWith("." + ext) inside the loop.

grkvlt
  • 2,577
  • 1
  • 21
  • 38
4

A file can just have one extension!

If you have a file test.tar.gz,

  • .gz is the extension and
  • test.tar is the Basename!

.tar in this case is part of the basename, not the part of the extension!

If you like to have a file encoded as tar and gz you should call it .tgz. To use a .tar.gz is bad practice, if you need to handle thesse files you should make a workaround like rename the file to test.tgz.

Grim
  • 1,938
  • 10
  • 56
  • 123
  • 1
    Beg to differ: the "extension" is whatever some application defines it to be. From a OS point of view there is no such thing as an "extension". – Ingo Jul 16 '13 at 13:41
  • A OS pov, hm ... its maybe more save to say from FS pov. – Grim Jul 16 '13 at 13:54
  • 1
    this is not helping if you are renaming your file to uuid + extension, you will get from `foo.tar.gz` to `61822326-ef4d-49f4-971d-b20269c72db9.gz` ... – Enerccio Jul 19 '18 at 17:52
0

Found a simple way. Use substring to get filename only and indexOf instead of lastIndexOf to get first '.' and extension after it

Bernice
  • 2,552
  • 11
  • 42
  • 74
0

You can get the filename part of the path, split on . and take the final 0, 1, or 2 elements in the array as the extension.

Of course if .tar.* (gz, bz2, etc.) is your only edge case it may be pragmatic to just build a solution that filters filenames for .tar. and use that as the point at which to extract the extension (to include the .tar portion).

cfeduke
  • 23,100
  • 10
  • 61
  • 65