3

I've got the following command which is giving me the size in bytes of a bunch of folders in my hadoop cluster:

$ hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{print $1, $3}'
31641789771845 /foo/bar/card_dim_h_tobedeleted
22541622495592 /foo/bar/transaction_item_fct_tobedeleted
3174354180367 /foo/bar/card_dim_h_new_tobedeleted
2336463389768 /foo/bar/hshd_loyalty_seg_tobedeleted
1238268384713 /foo/bar/prod_dim_h_tobedeleted
652639933614 /foo/bar/promo_item_fct_tobedeleted
490394392674 /foo/bar/card_dim_c_tobedeleted
365312782231 /foo/bar/ch_contact_offer_alc_fct_tobedeleted
218694228546 /foo/bar/prod_dim_h_new_tobedeleted
197884747070 /foo/bar/card_dim_h_test_tobedeleted
178553987067 /foo/bar/offer_dim_h_tobedeleted
124005189706 /foo/bar/promo_dim_h_tobedeleted
94380212623 /foo/bar/offer_tier_dtl_h_tobedeleted
91109144322 /foo/bar/ch_contact_offer_dlv_fct_tobedeleted
54487330914 /foo/bar/ch_contact_event_dlv_fct_tobedeleted

What I'd like to is format those numbers with GB/TB suffixes. I know I can use du -h to format them but once I do that the sort command doesn't work.

I know I can do something like this:

$ hdfs dfs -du -s /foo/bar/*tobedeleted | sort -r -k 1 -g | awk '{print $1, $3}' | awk '{total = $1 / 1024 /1024 / 1024 / 1024; print total "TB", $2}'
28.778TB /foo/bar/card_dim_h_tobedeleted
20.5015TB /foo/bar/transaction_item_fct_tobedeleted
2.88706TB /foo/bar/card_dim_h_new_tobedeleted
2.125TB /foo/bar/hshd_loyalty_seg_tobedeleted
1.1262TB /foo/bar/prod_dim_h_tobedeleted
0.593573TB /foo/bar/promo_item_fct_tobedeleted
0.446011TB /foo/bar/card_dim_c_tobedeleted
0.33225TB /foo/bar/ch_contact_offer_alc_fct_tobedeleted
0.198901TB /foo/bar/prod_dim_h_new_tobedeleted
0.179975TB /foo/bar/card_dim_h_test_tobedeleted
0.162394TB /foo/bar/offer_dim_h_tobedeleted
0.112782TB /foo/bar/promo_dim_h_tobedeleted
0.0858383TB /foo/bar/offer_tier_dtl_h_tobedeleted
0.0828633TB /foo/bar/ch_contact_offer_dlv_fct_tobedeleted
0.0495559TB /foo/bar/ch_contact_event_dlv_fct_tobedeleted

but that's prints everything as TB which isn't what I wanted. Probably I can put some clever if...then...else logic into that last awk command to do what I want but I'm hoping there's a simple formatting option I don't know about that will do what I want.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
jamiet
  • 10,501
  • 14
  • 80
  • 159
  • If you need to sort it first, then you're probably stuck with updating your `awk` script. If it weren't for the sort, you could use `du -h ...`. Also, you probably don't need two `awk` calls. You can do it with one. – lurker Apr 19 '16 at 15:23
  • Yes I realise it can be done in one awk call. I did it in two partly for clarity and partly cos I'm a noob :) – jamiet Apr 19 '16 at 15:27
  • 1
    Here are some [simple `awk` solutions](https://blog.urfix.com/25-awk-commands-tricks/). See #13. – lurker Apr 19 '16 at 15:28

3 Answers3

4

Perhaps this is what you are looking for:

hdfs dfs -du -s /foo/bar/*tobedeleted | \
    sort -r -k 1 -g | \
    awk '{ suffix=" KMGT"; for(i=1; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'
  • almost :) This did the job: `hdfs dfs -du -s /foo/bar/* | sort -r -k 1 -g | head -5 | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }'` – jamiet Apr 19 '16 at 20:29
  • I posted a slightly modified version of your solution as an answer. I tried to @ mention you but it didn't work, not sure why, don't quite understand @ mentions on SO. Thanks again anyway. – jamiet Apr 19 '16 at 20:40
  • @jamiet, no problem. Thanks for accepting. I discovered there is a bug in my solution if the number is <=1024 – Super-intelligent Shade Apr 20 '16 at 14:18
  • I fixed it now. Please note the space in the suffix variable. – Super-intelligent Shade Apr 20 '16 at 14:27
2

You can use du with -h option to display data in human readable way hdfs dfs -du -s -h /user/vgunnu

Here is more info https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#du

vgunnu
  • 826
  • 8
  • 6
2

@innocent-bystander figured it out (just had to slightly modify his/her suggested solution):

$ hdfs dfs -du -s /foo/bar/* | sort -r -k 1 -g | head -5 | awk '{ suffix="KMGT"; for(i=0; $1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i, 1), $3; }' 
28T /foo/bar/card_dim_h_tobedeleted
20T /foo/bar/transaction_item_fct_tobedeleted
2T /foo/bar/card_dim_h_new_tobedeleted
2T /foo/bar/hshd_loyalty_seg_tobedeleted
1T /foo/bar/prod_dim_h_tobedeleted

(taking head also just to save some space on this page)

Thank you so much. Not only for solving this but also teaching me stuff I didn't know about awk. Very powerful isnt it?

jamiet
  • 10,501
  • 14
  • 80
  • 159