Bash script to find largest file by parsing text tree

Question

I need to write a script that look for the largest file in a given directory (including its subdirectories).

I figured out that If I use "tree" to generate a textual representation of all files, maybe then I can have the script to compare the sizes and output the largest one.

I ended up with a text file that looks something like this

.
[        939]  "./Documents/Alfa/driver/wlan0up"
[        234]  "./Documents/Alfa/driver/wpa1.conf"
[    1623520]  "./Documents/Alfa/driver/wpa_supplicant-0.5.5.zip"
[    5488640]  "./Documents/Alfa/R36-V1.2.1.2b6.img"
[       3385]  "./Documents/C code/Ide.s"
[       4096]  "./Documents/fluxion-master"
[         25]  "./Documents/fluxion-master/_config.yml"
[       4096]  "./Documents/fluxion-master/docs"
[      35141]  "./Documents/fluxion-master/docs/LICENSE"
[      83788]  "./Documents/fluxion-master/fluxion"
~~ long list of other files
[       6909]  "./.ZAP/session/untitled2.script"
[      64411]  "./.ZAP/zap.log"
[       4096]  "./.zenmap"
[          0]  "./.zenmap/recent_scans.txt"
[       2018]  "./.zenmap/scan_profile.usp"
[         85]  "./.zenmap/target_list.txt"
[       1486]  "./.zenmap/zenmap.conf"
[     409600]  "./.zenmap/zenmap.db"
[          5]  "./.zenmap/zenmap_version"

429 directories, 3327 files

Now, all I need is to have the script read through the list and compare the sizes until the list ends, then output the largest file's name and size.

I went through some other stackoverflow entries, using sed and grep, but didn't get any luck.

Read a file line by line assigning the value to a variable

Looping through the content of a file in Bash?

https://codereview.stackexchange.com/questions/59417/extracting-data-from-text-file-in-bash-using-awk-grep-head-and-tail

Please note that tree is capable of formatting the output as xml file, using tags and attributes like <directory name="fileName" size="XXXX"></directory> so if parsing the xml file is easier, that would be fine too.

folders are also listed in there, but we can ignore that.

Any help would be appreciated, Thanks

elaborate your question: to find *the largest file* by file size OR by number of lines within a file ? — RomanPerekhrest, Mar 18 '17 at 14:28
Sorry, I meant by size, the number between brackets was in bytes.. anyway the solution was really simple.. thanks — Mo3tasm, Mar 18 '17 at 14:38

Socowi · Accepted Answer · 2017-03-18T14:43:16.657

Just sort your list by the numbers and grab the first line:

sort -V yourList.txt | head -n

I have the feeling that you have a rather large script to produce the list. The list is also a bit unsafe. What happens if a filename contains a newline (yes, that is possible on linux)? The following command finds the largest file and in the current directory (including subdirectories) and prints its size and name.

find . -type f -exec du -b {} + | sort -nr | head -n 1

if you want just the file, add | sed 's/^[0-9]\+\t//' to the end.

score 1 · Answer 2 · answered Mar 18 '17 at 14:56

Don't use tree. Instead, just iterate over the files and call stat to get the size of each file, remembering the largest file seen so far. In bash 4 or later, it is as simple as

shopt -s globstar
for f in **/*; do
    size=$(stat -c %s)
    if (( size > max_size )); then
        max_size=$size
        max_file=$f
    fi
done

If you have zsh available, it is as simple as fname=$(zsh -c 'print **/*(OL[1])').

With earlier version of bash, you need to define a recursive function to simulate **:

dir_iter () {
    for f in "$1"/*; do
        if [[ -d $f ]]; then
            dir_iter "$f"
        else
            size=$(stat -c %s)
            if ((size > max_size)); then
                max_size=$size
                max_file=$f
            fi
        fi
    done
}
dir_iter .

(Note that you should consult your local documentation for the exact form of the stat command, which may vary. BSD stat, for instance, uses -f instead of -c.)

One objection is that it requires multiple calls to stat. This is expensive, but avoids the (admittedly rare) issue of dealing with sequences of files names (which is complicated when file names can contain newlines).

If you have zsh available, it is as simple as max_file=$(zsh -c 'print **/*(OL[1])'). If you are actually using zsh, then it's just print -v max_file **/*(OL[1]).

If you decide to not worry about filenames with newlines, you can do the following:

find . -exec stat -c '%s %n' {} + | sort -k2,2nr | head -1

I leave dealing with filenames containing newlines as an exercise to the reader; typically, I would just use a different language that can represent sequences of arbitrary strings properly. Another option would be to look at the finfo command found in the examples/loadables directory of the bash source distribution. It's an example of creating a shell built-in command that does the same thing as stat without creating a new process. It could be modified to add a -v option similar to that supported by printf so that you can set a shell variable from the output.

finfo -v size -s "$f"  # equivalent to size=$(stat -c %s "$f"), but all in shell

Bash script to find largest file by parsing text tree

2 Answers2