0

I'm writing bash code that will search for specific files in the directory it is run in and add them into an array variable. The problem I am having is formatting the results. I need to find all the compressed files in the current directory and display both the names and sizes of the files in order of last modified. I want to take the results of that command and put them into an array variable with each line element containing the file's name and corresponding size but I don't know how to do that. I'm not sure if I should be using command "find" instead of "ls" but here is what I have so far:

find_files="$(ls -1st --block-size=MB)"
arr=( ($find_files) )
wcr221
  • 41
  • 4
  • 1
    https://mywiki.wooledge.org/ParsingLs – jordanm Sep 09 '20 at 20:44
  • 1
    See also [BashFAQ/003](https://mywiki.wooledge.org/BashFAQ/003) – Benjamin W. Sep 09 '20 at 20:48
  • There's still the issue of finding the right command. The closest one I've found was (find . -maxdepth 1 -type f -printf "%T@\t%f\n" | sort -n | cut -f2-). But this command doesn't display the MB size – wcr221 Sep 09 '20 at 20:52
  • 1
    define 'compressed files'? please provide an example of a directory of files (with and without some 'compressed' files), then provide the expected array contents (eg, an associative array where the indices are the filenames and the values are the MBs?); what if the MB value is not a nice round number (eg, 1,325.27 MB) ... store the integer value, the real value, include commas? – markp-fuso Sep 09 '20 at 21:15
  • Compressed files meaning any file that is either a .zip, .bz, .lzma, .xz, or .gz file – wcr221 Sep 09 '20 at 21:17
  • 1
    `arr=( $(anything) )` is an antipattern. See [BashPitfalls #50](http://mywiki.wooledge.org/BashPitfalls#hosts.3D.28_.24.28aws_....29_.29). – Charles Duffy Sep 09 '20 at 21:27
  • ...if you want to read a stream into an array, use `readarray` or `mapfile` if you have a modern version bash, or a `while read` loop if you don't. – Charles Duffy Sep 09 '20 at 21:28
  • ...btw, your best answer here will depend on whether you're guaranteed to have a GNU version of `find` (with `-printf`, to tell it to customize its output format). – Charles Duffy Sep 09 '20 at 21:29
  • 1
    BTW, [How can I store the “find” command results as an array in Bash](https://stackoverflow.com/questions/23356779) is closely related; if you didn't have the size requirement, I'd be tagging this duplicate -- it's probably worth reading even absent that. – Charles Duffy Sep 09 '20 at 21:30

2 Answers2

0

Both of these solutions work, and were tested via copy paste from this post.

The first is fairly slow. One problem is external program invocations within a loop - date for example, is invoked for every file. You could make it quicker by not including the date in the output array (see Notes below). Particularly for method 2 - that would result in no external command invocations inside the while loop. But method 1 is really the problem - orders of magnitude slower.

Also, somebody probably knows how to convert an epoch date to another format in awk for example, which could be faster. Maybe you could do the sort in awk too. Perhaps just keep the epoch date?

These solutions are bash / GNU heavy and not portable to other environments (bash here strings, find -printf). OP tagged linux and bash though, so GNU can be assumed.

Solution 1 - capture any compressed file - using file to match (slow)

  • The criteria for 'compressed' is if file output contains the word compress
  • Reliable enough, but perhaps there is a conflict with some other file type description?
  • file -l | grep compress (file 5.38, Ubuntu 20.04, WSL) indicates for me there are no conflicts at all (all files listed are compression formats)
  • I couldn't find a way of classifying any compressed file other than this
  • I ran this on a directory containing 1664 files - time (real) was 40 seconds
#!/bin/bash

# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.

# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1

# Make the array
# A here string (<<<) must be used, to keep array in the global environment
while IFS= read -r -d '' path; do
    [[ "$(file --brief "${path%% *}")" == *compress* ]] &&
    compressed_files[c++]="${path% *} $(date -d @${path##* })"
done < \
    <(
        find "$TARGET" -type f -printf '%p %s %T@\0' |
        awk '{$2 = ($2 / 1024); print}' |
        sort -n -k 3
    )

# Print results - to test
printf '%s\n' "${compressed_files[@]}"

Solution 2 - use file extensions - orders of magnitude faster

  • If you know exactly what extensions you are looking for, you can compose them in a find command

  • This is alot faster

  • On the same directory as above, containing 1664 files - time (real) was 200 miliseconds

  • This example looks for .gz, .zip, and .7z (gzip, zip and 7zip respectively)

  • I'm not sure if -type f -and -regex '.*[.]\(gz\|zip\|7z\) -and printf may be faster again, now I think of it. I started with globs cause I assumed that was quicker

  • That may also allow for storing the extension list in a variable..

  • This method avoids a file analysis on every file in your target

  • It also makes the while loop shorter - you're only iterating matches

  • Note the repetition of -printf here, this is due to the logic that find uses: -printf is 'True'. If it were included by itself, it would act as a 'match' and print all files

  • It has to be used as a result of a name match being true (using -and)

  • Perhaps somebody has a better composition?

#!/bin/bash

# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.

# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1

while IFS= read -r -d '' path; do
    compressed_files[c++]="${path% *} $(date -d @${path##* })"
done < \
    <(
        find "$TARGET" \
            -type f -and -name '*.gz'  -and -printf '%p %s %T@\0' -or \
            -type f -and -name '*.zip' -and -printf '%p %s %T@\0' -or \
            -type f -and -name '*.7z'  -and -printf '%p %s %T@\0' |
        awk '{$2 = ($2 / 1024); print}' |
        sort -n -k 3
    )

# Print results - for testing
printf '%s\n' "${compressed_files[@]}"

Sample output (of either method):

$ comp-find.bash /tmp
/tmp/comptest/websters_english_dictionary.tmp.tar.gz 265.148 Thu Sep 10 07:53:37 AEST 2020
/tmp/comptest/What_is_Systems_Architecture_PART_1.tar.gz 1357.06 Thu Sep 10 08:17:47 AEST 2020

Note:

  • You can add a literal K to indicate the block size / units (kilobytes)

  • If you want to print the path only from this array, you can use suffix removal: printf '%s\n' "${files[@]&& *}"

  • For no date in the array (it's used to sort, but then its job may be done), simply remove $(date -d @${path##* }) (incl. the space).

  • Kind of tangential, but to use different date formats, replace $(date -d @${path##* }) with: $(date -I -d @${path##* }) ISO format - note that short opts style: date -Id @[date] did not work for me $(date -d @${path##* } +%Y-%M-%d_%H-%m-%S) like ISO, but w/ seconds $(date -d @${path##* } +%Y-%M-%d_%H-%m-%S) same again, but w/ nanoseconds (find gives you nano seconds)

Sorry for the long post, hopefully it's informative.

dan
  • 4,846
  • 6
  • 15
0

I'm not sure exactly what format you want the array to be in, but here is a snippet that creates an associative array keyed by filename with the size as the value:

$ ls -l test.{zip,bz2}
-rw-rw-r-- 1 user group 0 Sep 10 13:27 test.bz2
-rw-rw-r-- 1 user group 0 Sep 10 13:26 test.zip

$ declare -A sizes; while read SIZE FILENAME ; do sizes["$FILENAME"]="$SIZE"; done < <(find * -prune -name '*.zip' -o -name *.bz2  | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")

$ echo "${sizes[@]@A}"
declare -A sizes=(["'test.zip'"]="0" ["'test.bz2'"]="0" )

$

And if you just want an array of literally "filename size" entries, that's even easier:

$ while read SIZE FILENAME ; do sizes+=("$FILENAME $SIZE"); done < <(find * -prune -name '*.zip' -o -name *.bz2  | xargs stat -c "%Y %s %N" | sort | cut -f 2,3 -d " ")

$ echo "${sizes[@]@A}"
declare -a sizes=([0]="'test.zip' 0" [1]="'test.bz2' 0")

$