331

On a Linux machine I would like to traverse a folder hierarchy and get a list of all of the distinct file extensions within it.

What would be the best way to achieve this from a shell?

GloryFish
  • 13,078
  • 16
  • 53
  • 43

18 Answers18

475

Try this (not sure if it's the best way, but it works):

find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u

It work as following:

  • Find all files from current folder
  • Prints extension of files if any
  • Make a unique sorted list
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Ivan Nevostruev
  • 28,143
  • 8
  • 66
  • 82
  • 9
    just for reference: if you want to exclude some directories from searching (e.g. `.svn`), use `find . -type f -path '*/.svn*' -prune -o -print | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u` [source](http://stackoverflow.com/a/2314680/304209) – Dennis Golomazov Nov 22 '12 at 13:05
  • Spaces will not make any difference. Each file name will be in separate line, so file list delimiter will be "\n" not space. – Ivan Nevostruev Aug 20 '13 at 20:43
  • 1
    On Windows, this works better and is much faster than find: dir /s /b | perl -ne 'print $1 if m/\.([^^.\\\\\]+)$/' | sort -u – Ryan Shillington Dec 09 '13 at 22:30
  • Note: if you want to make it an alias in `.bashrc`, you have to escape `$1` as `\$1`. In fact it seems escaping `$1` doesn't do harm for console usage either. – jakub.g Dec 04 '15 at 12:32
  • 4
    git variation of the answer: [use `git ls-tree -r HEAD --name-only` instead of `find`](http://stackoverflow.com/a/34088712/245966) – jakub.g Dec 04 '15 at 12:45
  • This seems to show the string after the first dot, e.g. `theme` from `page_manager.theme.inc`. – user151841 Jan 05 '16 at 17:52
  • 17
    A variation, this shows the list with counts per extension: `find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort | uniq -c | sort -n` – marcovtwout May 17 '16 at 15:08
  • As a drawback it find also files like `configs-0.1.6` which don't have extensions but have dots in it's name. – mrgloom Apr 03 '19 at 16:53
  • how to ignore hidden files coming into the list? – Ghansham Sep 23 '19 at 03:28
  • This is black magic, people. It works incredibly fast. Incredibly! – cablop Oct 25 '20 at 20:25
  • The perl bit can be shortened `perl -ne 's/.+\.// && print'` – Ahmad Ismail Jul 15 '21 at 15:35
88

No need for the pipe to sort, awk can do it all:

find . -type f | awk -F. '!a[$NF]++{print $NF}'
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
SiegeX
  • 135,741
  • 24
  • 144
  • 154
  • I am not getting this to work as an alias, I am getting awk: syntax error at source line 1 context is >>> !a[] <<< awk: bailing out at source line 1. What am I doing wrong? My alias is defined like this: alias file_ext="find . -type f -name '*.*' | awk -F. '!a[$NF]++{print $NF}'" – user2602152 Mar 01 '15 at 15:55
  • 2
    @user2602152 the problem is that you are trying to surround the entire one-liner with quotes for the `alias` command but the command itself already uses quotes in the find command. To fix this I would use `bash`'s literal string syntax as so: `alias file_ext=$'find . -type f -name "*.*" | awk -F. \'!a[$NF]++{print $NF}\''` – SiegeX Mar 14 '15 at 06:04
  • this doesn't work if one subdir has a . in it's name and the file doesn't have file extension. Example: when we run from maindir it will fail for `maindir/test.dir/myfile` – Nelson Teixeira Apr 02 '17 at 05:52
  • 1
    @NelsonTeixeira Add `-printf "%f\n"` to the end of the 'find' command and re-run your test. – SiegeX Apr 03 '17 at 16:33
  • I found what I was looking for. Your command help me list the file types but I wanted a number next to the type. Googled and found this find . -type f | sed -n 's/..*\.//p' | sort | uniq -c Thanks for the help – Big Joe Sep 28 '21 at 02:26
67

My awk-less, sed-less, Perl-less, Python-less POSIX-compliant alternative:

find . -name '*.?*' -type f | rev | cut -d. -f1 | rev  | tr '[:upper:]' '[:lower:]' | sort | uniq --count | sort -rn

The trick is that it reverses the line and cuts the extension at the beginning.
It also converts the extensions to lower case.

Example output:

   3689 jpg
   1036 png
    610 mp4
     90 webm
     90 mkv
     57 mov
     12 avi
     10 txt
      3 zip
      2 ogv
      1 xcf
      1 trashinfo
      1 sh
      1 m4v
      1 jpeg
      1 ini
      1 gqv
      1 gcs
      1 dv
Ondra Žižka
  • 43,948
  • 41
  • 217
  • 277
62

Recursive version:

find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort -u

If you want totals (how may times the extension was seen):

find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort | uniq -c | sort -rn

Non-recursive (single folder):

for f in *.*; do printf "%s\n" "${f##*.}"; done | sort -u

I've based this upon this forum post, credit should go there.

David Mohundro
  • 11,922
  • 5
  • 40
  • 44
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • Great! also works for my git scenario, was trying to figure out which type of files I have touched in the last commit: `git show --name-only --pretty="" | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort -u` – vulcan raven Feb 03 '20 at 14:45
41

Powershell:

dir -recurse | select-object extension -unique

Thanks to http://kevin-berridge.blogspot.com/2007/11/windows-powershell.html

David Mohundro
  • 11,922
  • 5
  • 40
  • 44
Simon R
  • 459
  • 4
  • 2
  • 21
    The OP said "On a Linux machine" – Forbesmyester Aug 05 '13 at 13:37
  • 12
    actually there is prowershell for linux out now: https://github.com/Microsoft/PowerShell-DSC-for-Linux – KIC Sep 16 '16 at 13:44
  • 5
    As written, this will also pick up directories that have a `.` in them (e.g. `jquery-1.3.4` will show up as `.4` in the output). Change to `dir -file -recurse | select-object extension -unique` to get only file extensions. – mcw Mar 05 '18 at 15:49
  • 2
    @Forbesmyester: People with Windows (like me) will find this question to. So this is usefull. – Roel Feb 25 '20 at 09:39
  • 3
    Thanks for Powershell answer. You don't assume how users search. Lot of people upvoted for a reason – Mahesh Apr 08 '20 at 02:52
19

Adding my own variation to the mix. I think it's the simplest of the lot and can be useful when efficiency is not a big concern.

find . -type f | grep -oE '\.(\w+)$' | sort -u
gkb0986
  • 3,099
  • 1
  • 24
  • 22
  • 1
    +1 for portability, although the regex is quite limited, as it only matches extensions consisting of a single letter. Using the regex from the accepted answer seems better: `$ find . -type f | grep -o -E '\.[^.\/]+$' | sort -u` – mMontu Dec 09 '13 at 11:48
  • 1
    Agreed. I slacked off a bit there. Editing my answer to fix the mistake you spotted. – gkb0986 Dec 09 '13 at 17:38
  • cool. I chenge quotes to doublequotes, update grep [biraries and **dependencies**](http://gnuwin32.sourceforge.net/packages/grep.htm)(because provided with git is outdated) and now this work under windows. feel like linux user. – msangel Apr 21 '15 at 00:24
  • 1
    I like this approach. Just would change the regex a bit `$ find . -type f | grep -Eo '\.(\w+)$' | sort -u`. The original one shows files without extension in my case that was not what I needed. – Fernando Crespo Mar 17 '21 at 17:51
  • Nr1, thanks alot for this minimal and elegant example – wuseman Oct 16 '22 at 23:45
13

Find everythin with a dot and show only the suffix.

find . -type f -name "*.*" | awk -F. '{print $NF}' | sort -u

if you know all suffix have 3 characters then

find . -type f -name "*.???" | awk -F. '{print $NF}' | sort -u

or with sed shows all suffixes with one to four characters. Change {1,4} to the range of characters you are expecting in the suffix.

find . -type f | sed -n 's/.*\.\(.\{1,4\}\)$/\1/p'| sort -u
user224243
  • 399
  • 1
  • 5
  • 1
    No need for the pipe to 'sort', awk can do it all: find . -type f -name "*.*" | awk -F. '!a[$NF]++{print $NF}' – SiegeX Dec 06 '09 at 12:14
  • @SiegeX Yours should be a separate answer. It found that command to work the best for large folders, as it prints the extensions as it finds them. But note that it should be: -name "*.*" – Ralf Aug 18 '11 at 07:54
  • @Ralf done, posted answer [here](http://stackoverflow.com/questions/1842254/how-can-i-find-all-of-the-distinct-file-extensions-in-a-folder-hierarchy/7170782#7170782). Not quite sure about what you mean by the `-name "."` thing because that's what it already is – SiegeX Aug 24 '11 at 05:24
  • 1
    I meant it should be -name "\*.\*", but StackOverflow removes the * characters, which probably happened in your comment as well. – Ralf Aug 24 '11 at 10:40
  • It seems like this should be the accepted answer, awk is preferable to perl as a command-line tool and it embraces the unix philosophy of piping small interoperable programs into cohesive and readable procedures. – jrz Sep 15 '15 at 15:54
9

I tried a bunch of the answers here, even the "best" answer. They all came up short of what I specifically was after. So besides the past 12 hours of sitting in regex code for multiple programs and reading and testing these answers this is what I came up with which works EXACTLY like I want.

 find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort -u
  • Finds all files which may have an extension.
  • Greps only the extension
  • Greps for file extensions between 2 and 16 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail).
  • Awk to print the extensions in lower case.
  • Sort and bring in only unique values. Originally I had attempted to try the awk answer but it would double print items that varied in case sensitivity.

If you need a count of the file extensions then use the below code

find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort | uniq -c | sort -rn

While these methods will take some time to complete and probably aren't the best ways to go about the problem, they work.

Update: Per @alpha_989 long file extensions will cause an issue. That's due to the original regex "[[:alpha:]]{3,6}". I have updated the answer to include the regex "[[:alpha:]]{2,16}". However anyone using this code should be aware that those numbers are the min and max of how long the extension is allowed for the final output. Anything outside that range will be split into multiple lines in the output.

Note: Original post did read "- Greps for file extensions between 3 and 6 characters (just adjust the numbers if they don't fit your need). This helps avoid cache files and system files (system file bit is to search jail)."

Idea: Could be used to find file extensions over a specific length via:

 find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{4,}" | awk '{print tolower($0)}' | sort -u

Where 4 is the file extensions length to include and then find also any extensions beyond that length.

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Shinrai
  • 359
  • 3
  • 7
  • Is the count version recursive? – Fernando Montoya Feb 03 '16 at 03:51
  • @Shinrai, In general works well. but if you have some random file extensions which are really long such as .download, it will break the ".download" into 2 parts and report 2 files one which is "downlo" and another which is "ad" – alpha_989 Dec 09 '17 at 20:49
  • @alpha_989, That's due to the regex "[[:alpha:]]{3,6}" will also cause an issue with extensions smaller than 3 characters. Adjust to what you need. Personally I'd say 2,16 should work in most cases. – Shinrai Apr 04 '18 at 00:28
  • Thanks for replying.. Yeah.. thats what I realized later on. It worked well after I modified it similar to what you mentioned. – alpha_989 Apr 04 '18 at 03:28
  • `find . -type f -name "*.*" | grep -o -E "\.[^\.]+$" | grep -o -E "[[:alpha:]]{2,16}" | awk '{print tolower($0)}' | sort | uniq -c | sort -rn` - this works well - but is there a way to get the total file size of each php extension ? – anjanesh Mar 02 '23 at 12:12
5

In Python using generators for very large directories, including blank extensions, and getting the number of times each extension shows up:

import json
import collections
import itertools
import os

root = '/home/andres'
files = itertools.chain.from_iterable((
    files for _,_,files in os.walk(root)
    ))
counter = collections.Counter(
    (os.path.splitext(file_)[1] for file_ in files)
)
print json.dumps(counter, indent=2)
Alvin
  • 2,533
  • 33
  • 45
Andres Restrepo
  • 389
  • 5
  • 6
4

Since there's already another solution which uses Perl:

If you have Python installed you could also do (from the shell):

python -c "import os;e=set();[[e.add(os.path.splitext(f)[-1]) for f in fn]for _,_,fn in os.walk('/home')];print '\n'.join(e)"
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
4

Another way:

find . -type f -name "*.*" -printf "%f\n" | while IFS= read -r; do echo "${REPLY##*.}"; done | sort -u

You can drop the -name "*.*" but this ensures we are dealing only with files that do have an extension of some sort.

The -printf is find's print, not bash. -printf "%f\n" prints only the filename, stripping the path (and adds a newline).

Then we use string substitution to remove up to the last dot using ${REPLY##*.}.

Note that $REPLY is simply read's inbuilt variable. We could just as use our own in the form: while IFS= read -r file, and here $file would be the variable.

Rajib
  • 453
  • 7
  • 10
2

None of the replies so far deal with filenames with newlines properly (except for ChristopheD's, which just came in as I was typing this). The following is not a shell one-liner, but works, and is reasonably fast.

import os, sys

def names(roots):
    for root in roots:
        for a, b, basenames in os.walk(root):
            for basename in basenames:
                yield basename

sufs = set(os.path.splitext(x)[1] for x in names(sys.argv[1:]))
for suf in sufs:
    if suf:
        print suf
2

I think the most simple & straightforward way is

for f in *.*; do echo "${f##*.}"; done | sort -u

It's modified on ChristopheD's 3rd way.

Robert
  • 1,964
  • 1
  • 22
  • 22
2

I don't think this one was mentioned yet:

find . -type f -exec sh -c 'echo "${0##*.}"' {} \; | sort | uniq -c
Dmitry B.
  • 9,107
  • 3
  • 43
  • 64
2

The accepted answer uses REGEX and you cannot create an alias command with REGEX, you have to put it into a shell script, I'm using Amazon Linux 2 and did the following:

  1. I put the accepted answer code into a file using :

    sudo vim find.sh

add this code:

find ./ -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u

save the file by typing: :wq!

  1. sudo vim ~/.bash_profile

  2. alias getext=". /path/to/your/find.sh"

  3. :wq!

  4. . ~/.bash_profile

Chris Medina
  • 338
  • 1
  • 10
0

you could also do this

find . -type f -name "*.php" -exec PATHTOAPP {} +
jrock2004
  • 3,229
  • 5
  • 40
  • 73
0

I've found it simple and fast...

   # find . -type f -exec basename {} \; | awk -F"." '{print $NF}' > /tmp/outfile.txt
   # cat /tmp/outfile.txt | sort | uniq -c| sort -n > tmp/outfile_sorted.txt
0

If you are looking for answer that respect .gitignore then check below answer.

git ls-tree -r HEAD --name-only | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u 
Nisharg Shah
  • 16,638
  • 10
  • 62
  • 73