How to get list of folders in a given bucket using Google Cloud API

Question

I wanted to get all the folders inside a given Google Cloud bucket or folder using Google Cloud Storage API.

For example if gs://abc/xyz contains three folders gs://abc/xyz/x1, gs://abc/xyz/x2 and gs://abc/xyz/x3. The API should return all three folder in gs://abc/xyz.

It can easily be done using gsutil

gsutil ls gs://abc/xyz

But I need to do it using python and Google Cloud Storage API.

You say you want to get the folders inside `xyz`, but the command `gsutil ls gs://abc/xyz` returns all objects in `xyz`, including non-folder items. So, which are you asking for? All folders, or all items, including folders? — Robino, Sep 28 '20 at 16:38

score 32 · Answer 1 · edited May 09 '21 at 01:46

This question is about listing the folders inside a bucket/folder. None of the suggestions worked for me and after experimenting with the google.cloud.storage SDK, I suspect it is not possible (as of November 2019) to list the sub-directories of any path in a bucket. It is possible with the REST API, so I wrote this little wrapper...

from google.api_core import page_iterator
from google.cloud import storage

def _item_to_value(iterator, item):
    return item

def list_directories(bucket_name, prefix):
    if prefix and not prefix.endswith('/'):
        prefix += '/'

    extra_params = {
        "projection": "noAcl",
        "prefix": prefix,
        "delimiter": '/'
    }

    gcs = storage.Client()

    path = "/b/" + bucket_name + "/o"

    iterator = page_iterator.HTTPIterator(
        client=gcs,
        api_request=gcs._connection.api_request,
        path=path,
        items_key='prefixes',
        item_to_value=_item_to_value,
        extra_params=extra_params,
    )

    return [x for x in iterator]

For example, if you have my-bucket containing:

dog-bark
- datasets
  - v1
  - v2

Then calling list_directories('my-bucket', 'dog-bark/datasets') will return:

['dog-bark/datasets/v1', 'dog-bark/datasets/v2']

brilliant. I'm going to edit your answer to replace the first few instances of "path" to "prefix" so as not to be confused with the path defined to be passed to the HTTPIterator. — RNHTTR, Jan 16 '20 at 23:10
Looks like a bit of a hack, using "private" member `_connection`. Much easier/safer method using `list_blobs(..)`. — Robino, Sep 28 '20 at 16:52
I've added a bit to the conditional on the prefix thing to resolve a bug where you couldn't list the root of the bucket with a prefix of `''` :) other than that this works perfectly for me, thanks for posting it! — Yet Another User, May 09 '21 at 01:48

score 19 · Answer 2 · answered May 09 '19 at 16:45

19

Here's an update to this answer thread:

from google.cloud import storage

# Instantiates a client
storage_client = storage.Client()

# Get GCS bucket
bucket = storage_client.get_bucket(bucket_name)

# Get blobs in bucket (including all subdirectories)
blobs_all = list(bucket.list_blobs())

# Get blobs in specific subirectory
blobs_specific = list(bucket.list_blobs(prefix='path/to/subfolder/'))

answered May 09 '19 at 16:45

Ekaba Bisong

2,918
2
23
38

4

while this might work for listing objects, this question is about listing subfolders, and this does not do that. @AntPhitlok 's answer is correct. – RNHTTR Jan 16 '20 at 23:12
1

@RNHTTR. You're right :) Leaving here for posterity sake. – Ekaba Bisong Mar 21 '20 at 09:37

jterrace · Accepted Answer · 2020-09-29T14:45:17.273

12

You can use the Python GCS API Client Library. See the Samples and Libraries for Google Cloud Storage documentation page for relevant links to documentation and downloads.

In your case, first I want to point out that you're confusing the term "bucket". I recommend reading the Key Terms page of the documentation. What you're talking about are object name prefixes.

You can start with the list-objects.py sample on GitHub. Looking at the list reference page, you'll want to pass bucket=abc, prefix=xyz/ and delimiter=/.

edited Sep 29 '20 at 14:45

answered May 06 '16 at 15:32

jterrace

64,866
22
157
202

3

Well, when we call `objects().list()` with prefix and delimiter we get list of matching objects AND matching prefixes. As @jterrace answered if we pass `prefix=abc/xyz` with `delimiter=/` we get all objects whose name start with `abc/xyz` as well as `prefixes` which can be logically considered as subfolder. – Shamshad Alam May 09 '16 at 18:21
I don't follow this answer. If the "url" is `gs://abc/xyz` then the bucket will be `abc`. If you also pass the bucket name in with the prefixes you are probably not going to match anything, and certainly not what you want. – Robino Sep 28 '20 at 17:09
2

@Robino you're right - I messed that up. Updated the answer. – jterrace Sep 29 '20 at 14:45

score 8 · Answer 4 · answered Jul 18 '19 at 16:53

To get a list of folders in a bucket, you can use the code snippet below:

import googleapiclient.discovery


def list_sub_directories(bucket_name, prefix):
    """Returns a list of sub-directories within the given bucket."""
    service = googleapiclient.discovery.build('storage', 'v1')

    req = service.objects().list(bucket=bucket_name, prefix=prefix, delimiter='/')
    res = req.execute()
    return res['prefixes']

# For the example (gs://abc/xyz), bucket_name is 'abc' and the prefix would be 'xyz/'
print(list_sub_directories(bucket_name='abc', prefix='xyz/'))

OP did ask to use the `google.cloud.storage` api... – Robino Sep 28 '20 at 16:49 — Robino, Sep 28 '20 at 16:49

score 6 · Answer 5 · answered Jul 20 '18 at 12:52

I also need to simply list the contents of a bucket. Ideally I would like something similar to what tf.gfile provides. tf.gfile has support for determining if an entry is a file or a directory.

I tried the various links provided by @jterrace above but my results were not optimal. With that said its worth showing the results.

Given a bucket which has a mix of "directories" and "files" its hard to navigate the "filesystem" to find items of interest. I've provided some comments in the code on how the code referenced above works.

In either case, I am using a datalab notebook with credentials included by the notebook. Given the results, I will need to use string parsing to determine which files are in a particular directory. If anyone knows how to expand these methods or an alternate method to parse the directories similar to tf.gfile, please reply.

Method One

import sys
import json
import argparse
import googleapiclient.discovery

BUCKET = 'bucket-sounds' 

def create_service():
    return googleapiclient.discovery.build('storage', 'v1')


def list_bucket(bucket):
    """Returns a list of metadata of the objects within the given bucket."""
    service = create_service()

    # Create a request to objects.list to retrieve a list of objects.
    fields_to_return = 'nextPageToken,items(name,size,contentType,metadata(my-key))'
    #req = service.objects().list(bucket=bucket, fields=fields_to_return)  # returns everything
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound')  # returns everything. UrbanSound is top dir in bucket
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREE') # returns the file FREESOUNDCREDITS.TXT
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/FREESOUNDCREDITS.txt', delimiter='/') # same as above
    #req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark', delimiter='/') # returns nothing
    req = service.objects().list(bucket=bucket, fields=fields_to_return, prefix='UrbanSound/data/dog_bark/', delimiter='/') # returns files in dog_bark dir

    all_objects = []
    # If you have too many items to list in one request, list_next() will
    # automatically handle paging with the pageToken.
    while req:
        resp = req.execute()
        all_objects.extend(resp.get('items', []))
        req = service.objects().list_next(req, resp)
    return all_objects

# usage
print(json.dumps(list_bucket(BUCKET), indent=2))

This generates results like this:

[
  {
    "contentType": "text/csv", 
    "name": "UrbanSound/data/dog_bark/100032.csv", 
    "size": "29"
  }, 
  {
    "contentType": "application/json", 
    "name": "UrbanSound/data/dog_bark/100032.json", 
    "size": "1858"
  } stuff snipped]

Method Two

import re
import sys
from google.cloud import storage

BUCKET = 'bucket-sounds'

# Create a Cloud Storage client.
gcs = storage.Client()

# Get the bucket that the file will be uploaded to.
bucket = gcs.get_bucket(BUCKET)

def my_list_bucket(bucket_name, limit=sys.maxsize):
  a_bucket = gcs.lookup_bucket(bucket_name)
  bucket_iterator = a_bucket.list_blobs()
  for resource in bucket_iterator:
    print(resource.name)
    limit = limit - 1
    if limit <= 0:
      break

my_list_bucket(BUCKET, limit=5)

This generates output like this.

UrbanSound/FREESOUNDCREDITS.txt
UrbanSound/UrbanSound_README.txt
UrbanSound/data/air_conditioner/100852.csv
UrbanSound/data/air_conditioner/100852.json
UrbanSound/data/air_conditioner/100852.mp3

OP asks for behaviour like "gsutil ls ...", which lists items in a folder. Your code lists all items in all subfolders, recursively. For a large folder structure you could get seriously more than you bargained for! — Robino, Sep 28 '20 at 16:45

Robino · Answer 6 · 2021-10-17T19:55:16.950

1. Get access to your client object.

Where is the code running?

I am (somewhere) inside the Google Cloud Platform (GCP)

If you are accessing Google Cloud Storage (GCS) from inside GCP, for example Google Kubernetes Engine (GKE), you should use a workload identity to configure your GKE service account to act as a GCS service account. https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity

Once you do this creating your client is as easy as

import google.cloud.storage as gcs
client = gcs.Client()

Out in the wild

If you are somewhere else: AWS, Azure, your dev machine, or otherwise outside GCP, then you need to choose between creating a service account key that you download (it's a json file with a cryptographic PRIVATE KEY in it) or by using a workload identity federation, such as provided by AWS, Azure and "friends".

Let's assume you have decided to download the new GCS service account file to /secure/gcs.json.

PROJECT_NAME = "MY-GCP-PROJECT"
from google.oauth2.service_account import Credentials
import google.cloud.storage as gcs
client = gcs.Client(
    project=PROJECT_NAME,
    credentials=Credentials.from_service_account_file("/secure/gcs.json"),
)

2. Make the list-folders request to GCS

In the OP, we are trying to get the folders inside path xyz in bucket abc. Note that paths in GCS, unlike Linux, do not start with a /, however, they should finish with one. So we will be looking for folders with the prefix xyz/. That is simply folders, not folders and all of their subfolders.

BUCKET_NAME = "abc"
blobs = client.list_blobs(
    BUCKET_NAME,
    prefix="xyz/",  # <- you need the trailing slash
    delimiter="/",
    max_results=1,
)

Note how we have asked for no more than a single blob. This is not a mistake: the blobs are the files themselves - we're only interested in folders. Setting max_results to zero doesn't work, see below.

3. Force the lazy-loading to...err..load!

Several of the answers up here have looped through every element in the iterator blobs, which could me many millions, but we don't need to do that. That said, if we don't loop through any elements, blobs won't bother making the api request to GCS at all.

next(blobs, ...) # Force blobs to load.
print(blobs.prefixes)

The blobs variable is an iterator with at most one element, but, if your folder has no files in it (at its level) then there may be zero elements. If there are zero elements, then next(blobs) will raise a StopIteration.

The second argument, the ellipsis ..., is simply my choice of default return value, should there be no next element. I feel this is more readable than, say, None, because it suggests to the reader that something worth noticing is happening here. After all, code that requests a value only to discard it on the same line does have all the hallmarks of a potential bug, so it is good to reassure our reader that this is deliberate.

Finally, suppose we have a tree structure under xyz of aaa, bbb, ccc, and then under ccc we have subsubfolder zzz. The output will then be

{'xyz/aaa', 'xyz/bbb', 'xyz/ccc'}

Note that, as required in OP, we do not see subsubfolder xyz/ccc/zzz.

PeNpeL · Answer 7 · 2020-09-29T18:29:25.383

4

I faced the same issue and managed to get it done by using the standard list_blobs described here:

from google.cloud import storage

storage_client = storage.Client()

# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(
    bucket_name, prefix=prefix, delimiter=delimiter
)

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

However, this only worked for me after I read AntPhitlok answer, and understood I have to make sure my prefix ends with / and I'm also using / as a delimiter.

Due to that fact, under the 'Blobs:' section, we will only get file names, not folders, if exist under the prefix folder. All the sub-directories will be listed under the 'Prefixes:' section.

It's important to note that blobs is in fact an iterator, so in order to get the sub-directories, we must "open" it. Therefore, leaving out the 'Blobs:' section from our code, will result in an empty set() inside blobs.prefixes

Edit: An example of usage - say I have a bucket named buck and inside it a directory named dir. inside dir I have another directory named subdir.

In order to list the directories inside dir, I can use:

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs('buck', prefix='dir/', delimiter='/')

print("Blobs:")
for blob in blobs:
    print(blob.name)

if delimiter:
    print("Prefixes:")
    for prefix in blobs.prefixes:
        print(prefix)

*Notice the / at the end of prefix and as the delimiter.

This call will print me the following:

Prefixes:
subdir/

edited Sep 29 '20 at 18:29

answered Aug 07 '20 at 09:10

PeNpeL

103
1
7

You don't say what values we need to use for prefix or delimiter. Can you add those to your answer please? – Robino Sep 28 '20 at 16:55
1

@Robino I added an example. The prefix is used to list only files and folders that starts with `prefix`. It's most useful when you only want to list files and folders of a specific directory. The important thing is that the prefix should end with '/'. The delimiter however help separate between the files and the folders within said directory. and as I have written, I have used '/' as the delimiter. – PeNpeL Sep 29 '20 at 18:04
Is there a way to execute the list operation like gsutil -l -a gs://bucket-name/* – Never_Give_Up Nov 08 '20 at 19:06
I haven't tested it but according to the documentation (https://cloud.google.com/storage/docs/listing-objects), you can use the -r flag like so: gsutil la -r gs://BUCKET-NAME/PREFIX** – PeNpeL Nov 09 '20 at 20:26
1

tried with latest google-cloud-storage 1.35.1, blobs.prefixes always return empty set for me, there are prefix with / – Rui Yang Feb 06 '21 at 05:51
@RuiYang In order to help you please share a code snippet of what you tried, and if possible, explain the folder structure and what directory you want to list. – PeNpeL Feb 19 '21 at 12:19
1

@RuiYang I am getting the same issue on 1.42 and 1.43 (tested on OSX and Linux). – Robino Oct 17 '21 at 12:03
This is not working on the latest versions 1.43 – Boorhin Jan 21 '22 at 13:13

Phillip Maire · Answer 8 · 2021-07-18T03:58:20.793

here is a simple solution

from google.cloud import storage # !pip install --upgrade google-cloud-storage
import os

# set up your bucket 
client = storage.Client()
storage_client = storage.Client.from_service_account_json('XXXXXXXX')
bucket = client.get_bucket('XXXXXXXX')

# get all the folder in folder "base_folder"
base_folder = 'model_testing'
blobs=list(bucket.list_blobs(prefix=base_folder))
folders = list(set([os.path.dirname(k.name) for k in blobs]))
print(*folders, sep = '\n')

if you want only the folders in the selected folder

base_folder = base_folder.rstrip(os.sep) # needed to remove any slashes at the end of the string 
one_out = list(set([base_folder+ os.sep.join(k.split(base_folder)[-1].split(os.sep)[:2]) for k in folders]))
print(*one_out, sep = '\n')

of course instead of using

list(set())

you could use numpy

import numpy as np
np.unique()

This cycles through every file path in the bucket. For massive buckets this will take a massive amount of time. GCP also charges you per lookup, so watch out! — Robino, Oct 17 '21 at 18:13
thanks for the heads up, to avoid this would I use something similar to your answer and use `max_results=1` like so `blobs=list(bucket.list_blobs(max_results=1, prefix=base_folder))`? — Phillip Maire, Oct 21 '21 at 16:13

score 1 · Answer 9 · answered Aug 05 '19 at 10:59

1

# sudo pip3 install --upgrade google-cloud-storage
from google.cloud import storage

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]= "./key.json"
storage_client = storage.Client()
bucket = storage_client.get_bucket("my-bucket")
blobs = list(bucket.list_blobs(prefix='dir/'))
print (blobs)

answered Aug 05 '19 at 10:59

J. Ceron

1,188
10
8

2

This gets all items from "dir" and all items from all subfolders, recursively. OP was asking about folders/items just at the folder level (non-recursive). – Robino Sep 28 '20 at 16:54

score 1 · Answer 10 · answered Aug 01 '20 at 02:38

1

#python notebook
ret_folders = !gsutil ls $path_possible_with_regex | grep -e "/$"
ret_folders_no_subdir = [x for x in srr_folders if x.split("/")[-2] != "SUBDIR”]

you can edit the condition to whatever works for you. in my case, I wanted only the deeper “folders”. for the save level folders, you can replace with

 x.split("/")[-2] == "SUBDIR”

answered Aug 01 '20 at 02:38

eilalan

669
4
20

OP does not want to use `gsutil`! – Robino Sep 28 '20 at 16:55

score 1 · Answer 11 · answered Sep 29 '20 at 18:37

Here's a simple way to get all subfolders:

from google.cloud import storage


def get_subdirs(bucket_name, dir_name=None):
    """
    List all subdirectories for a bucket or
    a specific folder in a bucket. If `dir_name`
    is left blank, it will list all directories in the bucket.
    """
    client = storage.Client()
    bucket = client.lookup_bucket(bucket_name)

    all_folders = []
    for resource in bucket.list_blobs(prefix=dir_name):

        # filter for directories only
        n = resource.name
        if n.endswith("/"):
            all_folders.append(n)

    return all_folders

# Use as follows:
all_folders = get_subdirs("my-bucket")

For buckets with a massive number of files this will take a massive amount of time, even if it only contains one subfolder. — Robino, Oct 17 '21 at 18:11

score 1 · Answer 12 · answered Aug 07 '22 at 08:21

Following https://stackoverflow.com/users/2350164/yet-another-user answer, I've created the same function with "standard" google instead of the HTTPIterator. Let's say we have: bucket named 'bucket_name' and subfolder named 'sub_folder_name'

from google.api_core import page_iterator
from google.cloud import storage
storage_client = storage.Client(project = PROJECT_NAME)
def get_folders_list(storage_client, bucket_or_name, prefix = ''):
        """
        the function returns the list of folders within a bucket or its subdirectory
        :param storage_client: the GCS client
        :param bucket_or_name: the name of the bucket
        :param prefix: the prefix if you want subdirectory
        :return: list of folders
        """
        if prefix and not prefix.endswith('/'):
            prefix += '/'

    blobs = storage_client.list_blobs(
        bucket_or_name=bucket_or_name,
        prefix=prefix,
        delimiter="/",
        # max_results=1
    )
    next(blobs, ...)
    return list(blobs.prefixes)

and you can use the 2 examples below for bucket or one of its subdirectories:

get_folders_list(storage_client = storage_client, bucket_or_name =
   'bucket_name')
get_folders_list(storage_client = storage_client, bucket_or_name = 'bucket_name', prefix = 'sub_folder_name')

score 1 · Answer 13 · answered Sep 01 '22 at 09:25

Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):

def list_gcs_directories(bucket, prefix):
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print(page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

You want to call this function as follows:

from google.cloud import storage

client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')

# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
                  for indiv_folder in list_folders]

score 1 · Answer 14 · answered Jan 12 '23 at 23:47

You can get all the unique prefixes N levels deep inside a bucket with a the python Cloud Storage library and a one-liner like so, for example where N=2:

set(["/".join(blob.name.split('/',maxsplit=2)[0:2]) for blob in client.list_blobs(BUCKET_NAME)])

If you want to restrict your results to a particular "folder", add a prefix like so:

set(["/".join(blob.name.split('/',maxsplit=2)[0:2]) for blob in client.list_blobs(BUCKET_NAME, prefix=PREFIX)])

Because your prefix will be one or more levels deep, you will need to adjust N. For example, to get unique prefixes 2 levels deep inside a prefix that is already 1 level deep, N should be 3.

I'm also surprised that nobody on this thread mentioned the gcsfs library, which lets you do

gcs = gcsfs.GCSFileSystem()
gcs.ls(BUCKET_NAME)

I am also very surprised that nobody mentioned gcsfs. This answer should be upvoted for that sake. — mac13k, Jan 16 '23 at 20:09

score 0 · Answer 15 · answered Jan 15 '22 at 08:59

In case anyone does not want to go through the learning curve of google-cloud-api you can basically use the subprocess module to run bash commands:

import subprocess
out=subprocess.run(["gsutil","ls","path/to/some/folder/"], capture_output=True)
out_list = out.stdout.decode("utf-8").split("\n")
dir_list = [i for i in out_list if i.endswith("/")]
files_list = [i for i in out_list if i not in dir_list]