Get distinct extension list Linux

Question

I am new in Linux and currently I am facing a problem. I want to get list of extensions (.doc, .pdf) from a folder. I googled a lot and finally I get a solution which is given below :

 find . -type f | awk -F. '!a[$NF]++{print $NF}'

I understand find . -type f, but unable to understand awk -F. '!a[$NF]++{print $NF}' what does it mean?

NF = Number of Fields in the current record

Can anyone explain?

Thanks in advance.

Possible duplicate of [How can I delete duplicate lines in a file in Unix?](https://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) — Sundeep, Feb 06 '18 at 04:47
Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Super User](http://superuser.com/) or [Unix & Linux Stack Exchange](http://unix.stackexchange.com/) would be a better place to ask. — jww, Feb 06 '18 at 05:29

kvantour · Answer 1 · 2018-02-06T13:23:56.083

To answer your question what the awk line is doing :

As you already indicated, the line find . -type f returns a list of files located in the current directory. Eg.

./foo.ext1
./bar.ext2
./spam.ext2
./ham.ext3
./spam.ham.eggs

This list of files is send with a pipe to the command awk -F. '!a[$NF]++{print $NF}'. This awk line contains a lot of information. First of all you need to know that awk is a record parser where each record consists of a number of fields. The default record is a line while the default field separator is a sequence of spaces. So what does your awk line do now :

-F. :: this redefines the field separator to be a dot (.). From this point forward all lines in the example have now 2 fields (eg line 1 foo and ext1) while the last line has 3 fields (spam, ham and eggs).
NF :: this is an awk variable that returns the number of fields per record. It is clear that the extension is represented by the last field ($NF)
a[$NF] :: this is a array where the index is the extension. The default array value is zero unless you assign something to it.
a[$NF]++ :: this returns the current value of a[$NF] and increments the value with 1 after the return. Thus for line 1, a["ext1"]++ returns 0 and sets a["ext1"] to 1. While for line 3, a["ext2"]++ returns 1 and sets a["ext2"] to 2. This indicates that a[$NF] keeps track of the amount of times $NF appeared.
!a[$NF]++ :: this combines the logic of the above but checks if return value of a[$NF]++ is 0. If it is 0, return true otherwise return false. In case of line 2 of the example, This statement will return true because a["ext2"]++ has value 0. However, after the statement a["ext2"] has the value 1. When reading line 3, the statement will return false. In other words, have we seen $NF already? And while you answer this question with "yes" or "no" increment the count of $NF with one.
!a[$NF]++{print $NF} :: this combines everything. It essentially states, If !a[$NF]++ returns true, then print $NF, but before printing increment a[$NF] by one. Or in other words, If the field representing the extension ($NF) appears for the first time, print that field. If it has already appeared before, do nothing.

The incrementing of the array is important as it keeps track of what has been seen already. So line by line the following will happen

foo.ext1       => $NF="ext1", a["ext1"] is 0 so print $NF and set a["ext1"]=1
bar.ext2       => $NF="ext2", a["ext2"] is 0 so print $NF and set a["ext2"]=1
spam.ext2      => $NF="ext2", a["ext2"] is 1 so do not print and set a["ext2"]=2
ham.ext3       => $NF="ext3", a["ext3"] is 0 so print $NF and set a["ext3"]=1
spam.ham.eggs  => $NF="eggs", a["eggs"] is 0 so print $NF and set a["eggs"]=1

The output is

ext1
ext2
ext3
eggs

General comments:

A file without any extensions al or not in a hidden directory (eg. ./path/to/awesome_filename_without_extension or ./path/to/.secret/filename_without_extension) or a part its full path printed as if it was the extension. The result however is meaning less, i.e.
```
/path/to/awesome_filename_without_extension
secret/awesome_filename_without_extension
```
This is best resolved as
```
find . -type f -exec  basename -a '{}' + \
  | awk -F. '((NF>1)&&(!a[$NF]++)){print $NF}'
```
Here the output of find is processed directly by basename which strips the directory from the filename. The awk line does one more check, do we have more then 1 field (i.e. is there an extension).

score 1 · Answer 2 · answered Feb 06 '18 at 05:45

A very simple way of doing what you are attempting is to sort the output keeping only unique extensions, e.g.

find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort -u

if your sort doesn't support the -u option, then you can pipe the results of sort to uniq, e.g.

find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort | uniq

The -regex option limits the find selection to filenames with at least one ASCII character extension. However it will also pickup files without an extension if they contain a '.', e.g. foo.bar.fatcat would result in fatcat being included in the list.

You could adjust the regular expression to meet your needs. If your version of find supports posix-extended regular expressions then you can prevent longer extensions from being picked up. For example to limit the extension to 1-3 characters, you could use:

find . -type f -regextype posix-extended -regex ".*[.][a-zA-Z0-9]{1,3}$" | \
awk -F '.' '{ print $NF }' | sort -u

There are other ways to approach this, but given your initial example, this is a close follow-on.

Nice answer+1! However what do you do with the hidden files? and the files with extension like `.tar.gz`? — Allan, Feb 06 '18 at 05:47
Yes, that is the sticky wicket. You can always adjust your regex in as wild of ways as you need. As is -- it does handle hidden file natively, but would not handle extensions with multiple parts such as `.tar.gz`. (doable -- just much longer regular expressions and `awk` logic) — David C. Rankin, Feb 06 '18 at 05:49

score -1 · Accepted Answer · answered Feb 06 '18 at 05:08

-1

You can use the following command for this purpose:

$find <DIR> -type f -print0 | xargs -0 -n1 basename | grep -Po '(?<=.)\..*$' | sort | uniq 
.bak
.c
.file
.file.bak
.input
.input.bak
.log
.log.bak
.out
.out.bak
.test
.test.bak
.txt
.txt.bak

where the find command will look for all files under the <DIR> subtree pass them to basename to get only their filename without the path part (-0, and -print0 are used to take into account files with spaces in their names), then you grep only the part of the string that starts with a . (the extension .tar, .txt, .tar.gz) and also it ignores the hidden files with their name starting with .. After that you sort them and get only the unique values.

If you do not need the starting . in the extension name add

| sed 's/^\.//'

answered Feb 06 '18 at 05:08

Allan

12,117
3
27
51

Thanks I got what you said, Can you please explain what does 'awk' means? – y hasnat Feb 06 '18 at 05:24
Hope it helps you! Yeah sure: in brief `awk` is a programming language used to manipulate text file in a column way, you can use it to do computation, or filtering ;-) It is very powerful but you need a bit of time to learn it since it is a whole programming language by itself ;-) The awk command you are using use `.` as field separator and does print the last field of each line if it has not already been print (using an `awk` array to do that and increasing the value of the cell indexed by the extension) – Allan Feb 06 '18 at 05:26
However do not use that command since it will fail in many cases if you have hidden files for example or extensions with several `.` ;-) like `.tar.gz` what is really common in Unix world – Allan Feb 06 '18 at 05:30
The name `awk` means Aho, Kernighan, and Weinberger, which are the names of the inventors/initial programmers, and Kernighan is the K in Kernighan & Ritchie, who are known for inventing the Language C. – user unknown Feb 06 '18 at 10:33
1

The `-print0 | xargs -0 -n1 basename` can be replaced with `-printf "%f\n"`. In most cases '-print0 | xargs -0` is better replaced with one of the `-exec`-variants (-exec, -execdir, -ok, -okdir) because it keeps the find-chain alive and it is just useless to pipe the output to another command, which just iterates over files as find already does. Except your find lacks the -exec-options. – user unknown Feb 06 '18 at 10:42

Get distinct extension list Linux

3 Answers3