hadoop fs -du / gsutil du is running slow on GCP

Question

I am trying to obtain the size of directors in Google bucket but command is running a long time.

I have tried with 8TB data having 24k subdirectory and files, it is taking around 20~25 minutes, conversely, same data on HDFS is taking less than a minute to get the size.

commands that I use to get the size

hadoop fs -du gs://mybucket
gsutil du gs://mybucket

Please suggest how can I do it faster.

Possible duplicate of [Fastest way to get Google Storage bucket size?](https://stackoverflow.com/questions/27374138/fastest-way-to-get-google-storage-bucket-size) — tix, Feb 17 '18 at 00:12
@tix Seems like it is duplicate, only addition to my question is comparison with HDFS. I am desperate to know why bucket performance is very low as compared to HDFS on Dataproc. — Kaustubh Deshpande, Feb 18 '18 at 05:26
HDFS is a real filesystem. GCS isn't. It has to iterate through every object in a bucket and stat it to calculate the bucket size. The linked answer suggests enabling Access Logs as a way of improving performance. — tix, Feb 18 '18 at 20:33
@KaustubhDeshpande 1) you should tell us what you're trying to do so we can offer a better solution. Is there a reason you need the actual dataset size?, 2) Is `gsutil -m du gs://mybucket` faster? The -m means multithreaded, 3) Have you tried using a spark/hadoop job to find this faster? I bet something like `sc.wholeTextFiles("gs://mybucket").keys.map(filename => fs.get(filename).size).collect().foldLeft(0)(_ + _)` would work. Aside: GCS has much higher latency than on-cluster HDFS -- like 100ms vs < 1ms). It is optimized for large files and high throughput reads, not metadata operations. — Karthik Palaniappan, Feb 20 '18 at 05:17

score 1 · Accepted Answer · answered Feb 18 '18 at 20:46

1 and 2 are nearly identical in that 1 uses GCS Connector.

GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.

This article suggests setting up Access Logs as alternative to gsutil du: https://cloud.google.com/storage/docs/working-with-big-data#data

However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:

Forward slashes in objects have no special meaning to Cloud Storage, as there is no native directory support. Because of this, deeply nested directory- like structures using slash delimiters are possible, but won't have the performance of a native filesystem listing deeply nested sub-directories.

Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp.

hadoop fs -du / gsutil du is running slow on GCP

1 Answers1

Linked