0

I tried to search for this solution through out but wasn't lucky. Hoping to find some solution quickly here. I have some migrated files in S3 and now there is a requirement to identify the number of folders involved in the give path. Say I have some files with as below.

If I give aws s3 ls s3://my-bucket/foo1 --recursive >> file_op.txt

"cat file_op.txt" - will look something like below:

my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file1.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7/file2.txt
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/file1.pdf
my-bucket/foo1/foo2/foo3/foo4/foo6/file2.txt
my-bucket/foo1/foo2/foo3/file3.txt
my-bucket/foo1/foo8/file1.txt
my-bucket/foo1/foo9/foo10/file4.csv

I have stored the output in a file and processed to find the number of files by wc -l But I couldn't find the number of folders involved in the path.

I need the output as below:

number of files : 7
number of folders : 9

EDIT 1: Corrected the expected number of folders.

(Excluding my-bucket and foo1)

(foo6 is in foo5 and foo4 directories)

Below is my code where I'm failing in calculating the count of directories:

#!/bin/bash
if [[ "$#" -ne 1 ]] ; then
    echo "Usage: $0 \"s3 folder path\" <eg. \"my-bucket/foo1\"> "
    exit 1
else
    start=$SECONDS
    input=$1
    input_code=$(echo $input | awk -F'/' '{print $1 "_" $3}')
    #input_length=$(echo $input | awk -F'/' '{print NF}' )
    s3bucket=$(echo $input | awk -F'/' '{print $1}')
    db_name=$(echo $input | awk -F'/' '{print $3}')
    pathfinder=$(echo $input | awk 'BEGIN{FS=OFS="/"} {first = $1; $1=""; print}'|sed 's#^/##g'|sed 's#$#/#g')
    myn=$(whoami)
    cdt=$(date +%Y%m%d%H%M%S)
    filename=$0_${myn}_${cdt}_${input_code}
    folders=${filename}_folders
    dcountfile=${filename}_dir_cnt
    aws s3 ls s3://${input} --recursive | awk '{print $4}' > $filename
    cat $filename |awk -F"$pathfinder" '{print $2}'| awk 'BEGIN{FS=OFS="/"}{NF--; print}'| sort -n | uniq > $folders
    #grep -oP '(?<="$input_code" ).*'
    fcount=`cat ${filename} | wc -l`
    awk 'BEGIN{FS="/"}
    {   if (NF > maxNF)
             {
                 for (i = maxNF + 1; i <= NF; i++)
                     count[i] = 1;
                 maxNF = NF;
             }
             for (i = 1; i <= NF; i++)
             {
                 if (col[i] != "" && $i != col[i])
                    count[i]++;
                 col[i] = $i;
             }
         }
         END {
             for (i = 1; i <= maxNF; i++)
                 print count[i];
    }'  $folders > $dcountfile
    dcount=$(cat $dcountfile | xargs | awk '{for(i=t=0;i<NF;) t+=$++i; $0=t}1' )
    printf "Bucket name : \e[1;31m $s3bucket \e[0m\n" | tee -a  ${filename}.out
    printf "DB name : \e[1;31m $db_name \e[0m\n" | tee -a  ${filename}.out
    printf "Given folder path : \e[1;31m $input \e[0m\n" | tee -a  ${filename}.out
    printf "The number of folders in the given directory are\e[1;31m $dcount \e[0m\n" | tee -a ${filename}.out
    printf "The number of files in the given directory are\e[1;31m $fcount \e[0m\n" | tee -a ${filename}.out
    end=$SECONDS
    elapsed=$((end - start))
    printf '\n*** Script completed in %d:%02d:%02d - Elapsed %d:%02d:%02d ***\n' \
           $((end / 3600)) $((end / 60 % 60)) $((end % 60)) \
           $((elapsed / 3600)) $((elapsed / 60 % 60)) $((elapsed % 60)) | tee -a ${filename}.out
    exit 0
fi
CodeDBA
  • 3
  • 4

3 Answers3

1

Your question is not clear.

If we count unique relatives folder paths in the list provided there are 12:

my-bucket/foo1/foo2/foo3/foo4/foo5/foo6/foo7
my-bucket/foo1/foo2/foo3/foo4/foo5/foo6
my-bucket/foo1/foo2/foo3/foo4/foo6
my-bucket/foo1/foo2/foo3/foo4/foo5
my-bucket/foo1/foo2/foo3/foo4
my-bucket/foo1/foo2/foo3
my-bucket/foo1/foo2
my-bucket/foo1/foo8
my-bucket/foo1/foo9/foo10
my-bucket/foo1/foo9
my-bucket/foo1
my-bucket

The awk script to count this is:

BEGIN {FS = "/";} # set field deperator to "/"
{  # for each input line
  commulativePath = OFS = ""; # reset commulativePath and OFS (Output Field Seperator) to ""
  for (i = 1; i < NF; i++) { # loop all folders up to file name
    if (i > 1) OFS = FS; # set OFS to "/" on second path
    commulativePath = commulativePath OFS $i;  # append current field to commulativePath variable
    dirs[commulativePath] = 0; # insert commulativePath into an associative array dirs
  }
}
END {
  print NR " " length(dirs); # print records count, and associative array dirs length
}

If we count unique folder names there are 11:

my-bucket
foo1
foo2
foo3
foo4
foo5
foo6
foo7
foo8
foo9
foo10

The awk script to count this is:

awk -F'/' '{for(i=1;i<NF;i++)dirs[$i]=1;}END{print NR " " length(dirs)}' input.txt
Dudi Boy
  • 4,551
  • 1
  • 15
  • 30
0

You have clarified that you wanted to count the unique names, ignoring the top two levels (my-bucket and foo1) and the last level (the file name).

perl -F/ -lane'
   ++$f;
   ++$d{ $F[$_] } for 2 .. $#F - 1;
   END {
      print "Number of files: ".( $f // 0 );
      print "Number of dirs: ".( keys(%d) // 0 );
   }
'

Output:

Number of files: 7
number of dirs: 9

Specifying file to process to Perl one-liner

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Thanks @ikegami for the response, my bad, I was not clear, but I was trying to exclude bucket name and also, trying to exclude the folders which are given in input. Edited the question now. Also, seems like the code given here not working properly for me as I'm getting output as below. Number of files: 18978 (as expected) Number of dirs: 4436/8192 (expected here is 24464) – CodeDBA Jan 24 '22 at 09:14
  • Adjusted to count the unique names, ignoring the top two levels and the last level. (Such a weird thing to count!) /// You are using an older version of Perl, which requires the use of `keys(%d)` instead of `%d` to get the number of elements in the hash. Adjusted. – ikegami Jan 24 '22 at 15:23
  • Thank you @ikegami, for the clarity but I still think I was not clear on my requirement, however, second answer gave me was really helpful to get the stuff going. The reason for this requirement was to verify the migrated folders count between source and target once migration completed. – CodeDBA Jan 25 '22 at 02:39
0

If you don't want mind using a pipe and calling awk twice, then it's rather clean :

 mawk 'BEGIN {OFS=ORS;FS="/";_^=_}_+_<NF && --NF~($_="")' file \    
 \
 | mawk 'NF {_[$__]} END { print length(_) }'
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11