0

What's a reliable way to get the extension of a file in Java?

I'm not talking about doing a substring / lastIndexOf . on the File.getName(), because that it useless on complex extensions such as .tar.gz and so on. (This is what all the libraries out there (Commons IO, Guava, etc.) seem to be doing. I am looking for a more sophisticated/reliable way of doing it which returns the real extension.

Although this sounds like a duplicate of many other questions here, it's not the same. The other posters have been happy with a simple answer that does a lastIndexOf .. This breaks cases where the extension is made out of more than one dot.

Isn't there simply a method which can be used to return this?

Any hints would be appreciated.

carlspring
  • 31,231
  • 29
  • 115
  • 197
  • 2
    Technically speaking `.tar.gz` is a `.gz` of a `.tar` methinks. – Ceiling Gecko Oct 12 '15 at 11:49
  • 7
    Please define "real extension". – biziclop Oct 12 '15 at 11:49
  • 2
    Possible duplicate of [Getting A File's Mime Type In Java](http://stackoverflow.com/questions/51438/getting-a-files-mime-type-in-java) – Tunaki Oct 12 '15 at 11:49
  • 2
    How should it be possible as long as `.` can be entered by the user as part of the file name? Nobody prevents me from creating a file called `my.file.name.that.has.dots.txt` - who could ever know what the extension and what the file name is? – Thomas Weller Oct 12 '15 at 11:50
  • 1
    @Thomas that's why OP mentions `lastIndexOf` which will return the last dot position. – Ceiling Gecko Oct 12 '15 at 11:50
  • OP, to my small knowledge, extensions with multiple dots do not exist. – Ceiling Gecko Oct 12 '15 at 11:54
  • @CeilingGecko: what if there is an application that can process `.dots.txt` files, and another one that can process `.has.dots.txt` files? Then there are three different extensions which are all valid. And you still don't know which program to use for opening the file. – Thomas Weller Oct 12 '15 at 11:54
  • 1
    [How about this? Get the MIME type and find the file extension with Apache Tika](http://stackoverflow.com/questions/13650372/how-to-determine-appropriate-file-extension-from-mime-type-in-java) – Murat Karagöz Oct 12 '15 at 11:54
  • 1
    @CeilingGecko although the OS doesn't care how the file ends there are numerous of scenarios where you ACTUALLY need to remove the "full" extension (i.e. `tar.gz`) from the filename, which is probably what the OP wants. So stop being a wise-ass, will you? – tftd Oct 12 '15 at 11:58
  • Traditionally, in Unix and Unix-like operating systems, extensions are not mandatory. Executable files usually don't have an extension. Hidden files start with a dot. Extensions like `txt` etc. are conveniences. Some people use `.tar.gz` to mark the gzipped version of a tarball. Some use `.tgz` instead. There is no rule. – RealSkeptic Oct 12 '15 at 11:58
  • @MuratK.: Thanks for pointing out the most useful suggestion so far! – carlspring Oct 12 '15 at 12:00
  • @tftd The extension separator cannot be a part of the extension identifier itself for the extension to be considered a "legal" extension, otherwise the OS itself would run into the same problem as OP. I realize that there are special cases, but for every application that actually handles them a special case is defined for just that "extension combo". – Ceiling Gecko Oct 12 '15 at 12:03
  • 1
    On Windows, [it's not possible to register a multi-dotted file extension](http://stackoverflow.com/a/16200023/4136325) – Thomas Weller Oct 12 '15 at 12:04
  • @Thomas: But that's for `C#`, not `Java`, isn't it? – carlspring Oct 12 '15 at 13:35
  • It's for Windows, independent of the programming language – Thomas Weller Oct 13 '15 at 11:08

1 Answers1

5

What's a reliable way to get the extension of a file in Java?

There is no reliable way, because there is no reliable way of distinguishing a file suffix from a filename that has dot (period) characters in it.

Or to put it another way, the "real" extension is a construction placed the filename by the human reader. And I think you will find that different people place different constructions. (The real extension for "foo.tar.gz" is either "gz" or "tar.gz", depending on your point of view ... and what the application is designed to do.)

The best you can do is to code your application to use either "stuff after first dot" or "stuff after last dot" as the suffix, depending on what it needs. (And maybe a bit of filtering to distinguish expected extensions from stuff that the application does not understand.)


Then there is the problem that the file extension (however you extract it) is not a reliable indicator of the file's format / meaning. You can attempt to determine the format by using something like Apache Tika. However, even that can be problematic, if the format is not recognized, or (worse) if there are multiple possible formats for a given file.


Returning to the foo.tar.gz example, as far as I am aware, the only program that relies on the file extension is the gunzip command which will uncompress foo.tar.gz as foo.tar. The tar command itself is agnostic of the file extension:

  • It will read any file as a TAR file, irrespective of the extension.
  • If the TAR file is compressed (using gzip compression), then you need to supply the -z or --gzip or equivalent option, irrespective of the extension.

Most UNIX / Linux programs are similarly agnostic of file extensions.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216