Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

Question

I am running this command --

sudo -u hdfs hadoop fs -du -h /user | sort -nr

and the output is not sorted in terms of gigs, Terabytes,gb

I found this command -

hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'

but did not seem to work.

is there a way or a command line flag i can use to make it sort and output should look like--

123T  /xyz
124T  /xyd
126T  /vat
127G  /ayf
123G  /atd

Please help

regards Mayur

score 10 · Answer 1 · answered Feb 25 '20 at 08:56

10

hdfs dfs -du -h <PATH> | sed 's/ //' | sort -hr

sed will strip out the space between the number and the unit, after which sort will be able to understand it.

answered Feb 25 '20 at 08:56

Neil

3,899
1
29
25

It works as expected rather because when just use "sort -hr" the order is not ordered by unit instead of the file size in human readable i.e. 45 MB, 50 GB, 75 MB, 80 GB. Thanks – m hanif f Mar 15 '21 at 10:13

score 7 · Answer 2 · edited Aug 05 '19 at 20:42

7

hdfs dfs -du -h <PATH> | awk '{print $1$2,$3}' | sort -hr

Short explanation:

The hdfs command gets the input data.
The awk only prints the first three fields with a comma in between the 2nd and 3rd.
The -h of sort compares human readable numbers like 2K or 4G, while the -r reverses the sort order.

edited Aug 05 '19 at 20:42

B--rian

5,578
10
38
89

answered Aug 05 '19 at 19:44

Li Su

87
1
1

Thanks for your contribution. Could you be so kind and elaborate on what is happening, and relate to the question a bit rather than a pure-code-snippet answer? – B--rian Aug 05 '19 at 19:53

score 3 · Answer 3 · answered Apr 09 '19 at 13:12

This is a rather old question, but stumbled across it while trying to do the same thing. As you were providing the -h (human readable flag) it was converting the sizes to different units to make it easier for a human to read. By leaving that flag off we get the aggregate summary of file lengths (in bytes).

sudo -u hdfs hadoop fs -du -s '/*' | sort -nr

Not as easy to read but means you can sort it correctly.

See https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/FileSystemShell.html#du for more details.

score 0 · Answer 4 · answered Sep 13 '21 at 14:38

I would use some small skript. It's primitive but reliable

#!/bin/bash
PATH_TO_FOLDER="$1"
hdfs dfs -du -h $PATH_TO_FOLDER > /tmp/output
cat /tmp/output | awk '$2 ~ /^[0-9]+$/  {print $1,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "K" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "M" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "G" ) print $1,$2,$NF}' | sort -k1,1n
cat /tmp/output | awk ' {if ($2 == "T" ) print $1,$2,$NF}' | sort -k1,1n
rm /tmp/output

score -1 · Answer 5 · answered Jun 29 '16 at 04:10

Try this to sort hdfs dfs -ls -h /path sort -r -n -k 5

-rw-r--r-- 3 admin admin 108.5 M 2016-05-05 17:23 /user/admin/2008.csv.bz2 -rw-r--r-- 3 admin admin 3.1 M 2016-05-17 16:19 /user/admin/warand_peace.txt Found 11 items drwxr-xr-x - admin admin 0 2016-05-16 17:34 /user/admin/oozie-oozi drwxr-xr-x - admin admin 0 2016-05-16 16:35 /user/admin/Jars drwxr-xr-x - admin admin 0 2016-05-12 05:30 /user/admin/.Trash drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_21 drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_20 drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_19 drwxrwxrwx - admin admin 0 2016-05-16 11:21 /user/admin/2015_11_18 drwx------ - admin admin 0 2016-05-16 17:38 /user/admin/.staging

It did not work :( , I am trying to get the disk usage. But i fixed it -- sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }' ) — Mayur Narang, Jul 12 '16 at 17:55

Hadoop fs -du-h sorting by size for M, G, T, P, E, Z, Y

5 Answers5