1

I'm using the Google Cloud Java API to get objects out of Google Cloud Storage (GCS). The code for this reads something like this:

Storage storage = ...
List<StorageObject> storageObjects = storage.objects().list(bucket).execute().getItems();

But this will not return all items (storage objects) in the GCS bucket, it'll only return the first 1000 items in the first "page". So in order to get the next 1000 items one should do:

Storage.Objects.List list = storage.objects().list(bucket).execute();
String nextPageToken = objects.getNextPageToken();
List<StorageObject> itemsInFirstPage = objects.getItems();

if (nextPageToken != null) {
    // recurse
}

What I want to do is to find an item that matches a Predicate while traversing all items in the GCS bucket until the predicate is matched. To make this efficient I'd like to only load the items in the next page when the item wasn't found in the current page. For a single page this works:

Predicate<StorageObject> matchesItem = ...
takeWhile(storage.objects().list(bucket).execute().getItems().stream(), not(matchesItem));

Where takeWhile is copied from here.

And this will load the storage objects from all pages recursively:

private Stream<StorageObject> listGcsPageItems(String bucket, String pageToken) {
    if (pageToken == null) {
        return Stream.empty();
    }


    Storage.Objects.List list = storage.objects().list(bucket);
    if (!pageToken.equals(FIRST_PAGE)) {
        list.setPageToken(pageToken);
    }
    Objects objects = list.execute();
    String nextPageToken = objects.getNextPageToken();
    List<StorageObject> items = objects.getItems();
    return Stream.concat(items.stream(), listGcsPageItems(bucket, nextPageToken));    
}

where FIRST_PAGE is just a "magic" String that instructs the method not to set a specific page (which will result in the first page items).

The problem with this approach is that it's eager, i.e. all items from all pages are loaded before the "matching predicate" is applied. I'd like this to be lazy (one page at a time). How can I achieve this?

Community
  • 1
  • 1
Johan
  • 37,479
  • 32
  • 149
  • 237

1 Answers1

4

I would implement custom Iterator<StorageObject> or Supplier<StorageObject> which would keep current page list and next page token in its internal state producing StorageObjects one by one.

Then I would use the following code to find the first match:

Optional<StorageObject> result =
    Stream.generate(new StorageObjectSupplier(...))
        .filter(predicate)
        .findFirst();

Supplier will only be invoked until the match is found, i.e. lazily.

Another way is to implement supplier by-page, i.e. class StorageObjectPageSupplier implements Supplier<List<StorageObject>> and use stream API to flatten it:

Optional<StorageObject> result =
    Stream.generate(new StorageObjectPageSupplier(...))
        .flatMap(List::stream)
        .filter(predicate)
        .findFirst();
Yaroslav Stavnichiy
  • 20,738
  • 6
  • 52
  • 55