1
and 2
are nearly identical in that 1
uses GCS Connector.
GCS calculates usage by making list requests, which can take a long time if you have a large number of objects.
This article suggests setting up Access Logs as alternative to gsutil du
:
https://cloud.google.com/storage/docs/working-with-big-data#data
However, you will likely still incur the same 20-25 minute cost if you intend to do any analytics on the data. From GCS Best Practices guide:
Forward slashes in objects have no special meaning to Cloud Storage,
as there is no native directory support. Because of this, deeply
nested directory- like structures using slash delimiters are possible,
but won't have the performance of a native filesystem listing deeply
nested sub-directories.
Assuming that you intend to analyze this data; you may want to consider benchmarking fetch performance of different file sizes and glob expressions with time hadoop distcp
.