2

I am trying to obtain the size of directors in Google bucket but command is running a long time.

I have tried with 8TB data having 24k subdirectory and files, it is taking around 20~25 minutes, conversely, same data on HDFS is taking less than a minute to get the size.

commands that I use to get the size

  1. hadoop fs -du gs://mybucket

  2. gsutil du gs://mybucket

Please suggest how can I do it faster.

WebDevBooster
  • 14,674
  • 9
  • 66
  • 70
  • Possible duplicate of [Fastest way to get Google Storage bucket size?](https://stackoverflow.com/questions/27374138/fastest-way-to-get-google-storage-bucket-size) – tix Feb 17 '18 at 00:12
  • @tix Seems like it is duplicate, only addition to my question is comparison with HDFS. I am desperate to know why bucket performance is very low as compared to HDFS on Dataproc. – Kaustubh Deshpande Feb 18 '18 at 05:26
  • HDFS is a real filesystem. GCS isn't. It has to iterate through every object in a bucket and stat it to calculate the bucket size. The linked answer suggests enabling Access Logs as a way of improving performance. – tix Feb 18 '18 at 20:33
  • @KaustubhDeshpande 1) you should tell us what you're trying to do so we can offer a better solution. Is there a reason you need the actual dataset size?, 2) Is `gsutil -m du gs://mybucket` faster? The -m means multithreaded, 3) Have you tried using a spark/hadoop job to find this faster? I bet something like `sc.wholeTextFiles("gs://mybucket").keys.map(filename => fs.get(filename).size).collect().foldLeft(0)(_ + _)` would work. Aside: GCS has much higher latency than on-cluster HDFS -- like 100ms vs < 1ms). It is optimized for large files and high throughput reads, not metadata operations. – Karthik Palaniappan Feb 20 '18 at 05:17

1 Answers1

1

1 and 2 are nearly identical in that 1 uses GCS Connector.

GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.

This article suggests setting up Access Logs as alternative to gsutil du: https://cloud.google.com/storage/docs/working-with-big-data#data

However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:

Forward slashes in objects have no special meaning to Cloud Storage, as there is no native directory support. Because of this, deeply nested directory- like structures using slash delimiters are possible, but won't have the performance of a native filesystem listing deeply nested sub-directories.

Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp.

tix
  • 2,138
  • 11
  • 18