Both of these solutions work, and were tested via copy paste from this post.
The first is fairly slow. One problem is external program invocations within a loop - date
for example, is invoked for every file. You could make it quicker by not including the date in the output array (see Notes below). Particularly for method 2 - that would result in no external command invocations inside the while
loop. But method 1 is really the problem - orders of magnitude slower.
Also, somebody probably knows how to convert an epoch date to another format in awk
for example, which could be faster. Maybe you could do the sort in awk
too. Perhaps just keep the epoch date?
These solutions are bash
/ GNU heavy and not portable to other environments (bash here strings, find
-printf
). OP tagged linux
and bash
though, so GNU can be assumed.
Solution 1 - capture any compressed file - using file
to match (slow)
- The criteria for 'compressed' is if
file
output contains the word compress
- Reliable enough, but perhaps there is a conflict with some other file type description?
file -l | grep compress
(file 5.38, Ubuntu 20.04, WSL) indicates for me there are no conflicts at all (all files listed are compression formats)
- I couldn't find a way of classifying any compressed file other than this
- I ran this on a directory containing 1664 files - time (real) was 40 seconds
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
# Make the array
# A here string (<<<) must be used, to keep array in the global environment
while IFS= read -r -d '' path; do
[[ "$(file --brief "${path%% *}")" == *compress* ]] &&
compressed_files[c++]="${path% *} $(date -d @${path##* })"
done < \
<(
find "$TARGET" -type f -printf '%p %s %T@\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - to test
printf '%s\n' "${compressed_files[@]}"
Solution 2 - use file extensions - orders of magnitude faster
If you know exactly what extensions you are looking for, you can
compose them in a find
command
This is alot faster
On the same directory as above, containing 1664 files - time (real) was 200 miliseconds
This example looks for .gz
, .zip
, and .7z
(gzip, zip and 7zip respectively)
I'm not sure if -type f -and -regex '.*[.]\(gz\|zip\|7z\) -and printf
may be faster again, now I think of it. I started with globs cause I assumed that was quicker
That may also allow for storing the extension list in a variable..
This method avoids a file
analysis on every file in your target
It also makes the while loop shorter - you're only iterating matches
Note the repetition of -printf
here, this is due to the logic that
find uses: -printf
is 'True'. If it were included by itself, it would
act as a 'match' and print all files
It has to be used as a result of a name match being true (using -and
)
Perhaps somebody has a better composition?
#!/bin/bash
# Capture all files, recursively, in $TARGET, that are
# compressed files. In an indexed array. Using file name
# extensions to match.
# Initialise variables, and check the target is valid
declare -g c= compressed_files= path= TARGET=$1
[[ -r "$TARGET" ]] || exit 1
while IFS= read -r -d '' path; do
compressed_files[c++]="${path% *} $(date -d @${path##* })"
done < \
<(
find "$TARGET" \
-type f -and -name '*.gz' -and -printf '%p %s %T@\0' -or \
-type f -and -name '*.zip' -and -printf '%p %s %T@\0' -or \
-type f -and -name '*.7z' -and -printf '%p %s %T@\0' |
awk '{$2 = ($2 / 1024); print}' |
sort -n -k 3
)
# Print results - for testing
printf '%s\n' "${compressed_files[@]}"
Sample output (of either method):
$ comp-find.bash /tmp
/tmp/comptest/websters_english_dictionary.tmp.tar.gz 265.148 Thu Sep 10 07:53:37 AEST 2020
/tmp/comptest/What_is_Systems_Architecture_PART_1.tar.gz 1357.06 Thu Sep 10 08:17:47 AEST 2020
Note:
You can add a literal K
to indicate the block size / units (kilobytes)
If you want to print the path only from this array, you can use suffix removal: printf '%s\n' "${files[@]&& *}"
For no date in the array (it's used to sort, but then its job may be done), simply remove $(date -d @${path##* })
(incl. the space).
Kind of tangential, but to use different date formats, replace $(date -d @${path##* })
with:
$(date -I -d @${path##* })
ISO format - note that short opts style: date -Id @[date]
did not work for me
$(date -d @${path##* } +%Y-%M-%d_%H-%m-%S)
like ISO, but w/ seconds
$(date -d @${path##* } +%Y-%M-%d_%H-%m-%S)
same again, but w/ nanoseconds (find
gives you nano seconds)
Sorry for the long post, hopefully it's informative.