2

With the following method I'm able to list all files from my Google Drive Account:

def listAllFiles(self):
    result = [];
    page_token = None;

    while True:
        try:
            param = {"q" : "trashed=false", "orderBy": "createdTime"};
            if page_token: param['pageToken'] = page_token;
            files = self.service.files().list(**param).execute();

            result.extend(files["files"]);
            page_token = files.get('nextPageToken');
            if not page_token: break;

        except errors.HttpError as error:
            print('An error occurred:', error);
            break; # Exit with empty list

    return result;

For a better runtime I would like to return a generator from this method. I'm pretty new to Python so I don't know how to do this.

The execute method from the files services always returns 100 items and if it returns a page_token too there are more items to fetch. It would be great if I could iterate over the generator to get the already fetched items and in the mean time the next items are fetched from the service. I hope you understand what I mean...

Is this possible? How do I have to rewrite this method to get the described functionality?

Cilenco
  • 6,951
  • 17
  • 72
  • 152
  • 1
    Return `iter(result)`? Or something like that... – OneCricketeer Jul 17 '16 at 17:34
  • 8
    What's with all the semicolons? – kindall Jul 17 '16 at 17:35
  • That will still fetch the files all at once, though. If the API does not provide generator functions itself, then there isn't much you can change to fix that – OneCricketeer Jul 17 '16 at 17:36
  • 1
    I think you need parallelism for that. Basically you can turn this function into a generator by simply replacing `result.extend(files["files"]);` with `for f in files["files"]: yield f` and remove the `return result`. But what you want is to prefetch the next files while other ones are still processed. `yield` would create a coroutine which simply holds on the position until another item is requested by the iterator. – Michael Hoff Jul 17 '16 at 17:36
  • You can use rewrite your function to return a [Queue](https://docs.python.org/2/library/queue.html) which is gets filled by a worker thread you spawn inside your function. The requesting thread would `get` items until all items are processed. – Michael Hoff Jul 17 '16 at 17:40
  • @MichaelHoff this sounds great. Can you explain this in more detail? – Cilenco Jul 17 '16 at 17:45

3 Answers3

4

You can rewrite your function to act as a generator by simply yielding single file paths.

Untested:

def listAllFiles(self):
    result = []
    page_token = None

    while True:
        try:
            param = {"q" : "trashed=false", "orderBy": "createdTime"}
            if page_token:
                param['pageToken'] = page_token
            files = self.service.files().list(**param).execute()

            # call future to load the next bunch of files here!
            for f in files["files"]:
                yield f
            page_token = files.get('nextPageToken')
            if not page_token: break

        except errors.HttpError as error:
            print('An error occurred:', error)
            break

If you do not further parallelize use chapelo's answer instead. Yielding the list of all available files will allow the coroutine to continue and thus, begin to fetch the next list of files concurrently.


Preloading the next bunch with futures

Now, you are still not loading the next bunch of files concurrently. For this, as mentioned in the code above, you could execute a future to already gather the next list of files concurrently. When your yielded item is consumed (and your function continues to execute) you look into your future to see whether the result is already there. If not, you have to wait (as before) until the result arrives.

As I don't have your code available I can not say whether this code works (or is even syntactically correct), but you can use it as a starting point:

import concurrent.futures

def load_next_page(self, page_token=None):
    param = {"q" : "trashed=false", "orderBy": "createdTime"}
    if page_token:
        param['pageToken'] = page_token

    result = None
    try:
        files = self.service.files().list(**param).execute()
        result = (files.get('nextPageToken'), files["files"])
    except errors.HttpError as error:
        print('An error occurred:', error)
    return result

def listAllFiles(self):
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:

        future = executor.submit(self.load_next_page, 60) 

        while future:
            try:
                result = future.result()
                future = None
                if not result:
                    break
                (next_page_token, files) = result            
            except Exception as error:
                print('An error occured:', error)
                break
            if next_page_token:
                future = executor.submit(self.load_next_page, next_page_token, 60) 
            # yield from files
            for f in files:
                yield f

Producer/Consumer parallelization with Queues

Another option, as also mentioned in the comments, is to use a Queue. You can modify your function to return a queue which is filled by a thread spawned by your function. This should faster than only preloading the next list, but also yields a higher implementation overhead.

I, personally, would recommend to go with the future path -- if the performance is adequate.

Community
  • 1
  • 1
Michael Hoff
  • 6,119
  • 1
  • 14
  • 38
1

If you yield each file at a time, you are blocking the generator. But if you yield the whole list that the generator has prepared, while you process the list of files, the generator will have another list ready for you:

Instead of Michael's suggestion

for f in files["files"]:
    yield f

Try to yield the whole list at once, and process the whole list of files when you receive it:

yield files["files"]

Consider this simple example:

from string import ascii_uppercase as letters, digits
lst_of_lsts = [[l+d for d in digits] for l in letters]

def get_a_list(list_of_lists):
    for lst in list_of_lists:
        yield lst  # the whole list, not each element at a time

gen = get_a_list(lst_of_lsts)

print(gen.__next__()) # ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9']

print(gen.__next__()) # ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9']

print(gen.__next__()) # ['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']

# And so on...
chapelo
  • 2,519
  • 13
  • 19
  • Good point! This is just slightly less efficient (my future starts to preload before having yielded the sub-list to the consumer) than preloading the next list only and is much less complex. Usability is different, as you yield whole file lists, however I think you could even add some itertools wrapper around that to provide individual items without losing your advantage. – Michael Hoff Jul 17 '16 at 18:25
0

You're going to have to change the flow of your script. Instead of returning all the files at once, you're going to need to yield individual files. This will allow you to handle the fetching of results in the background as well.

Edit: The fetching of subsequent results would be transparent to the calling function, it would simply appear to take a bit longer. Essentially, once the current list of files have all been yielded to the calling function, you would get the next list, and start yielding from that list, repeat until there are no more files to list from Google Drive.

I highly suggest reading What does the "yield" keyword do in Python? to understand the concept behind generators & the yield statement.

Community
  • 1
  • 1
Kyle
  • 4,205
  • 1
  • 21
  • 22
  • 1
    Can you explain how a function which is on hold (when yielding) can load the next list of files concurrently? – Michael Hoff Jul 17 '16 at 17:43
  • It can't. Would "It'd be able to silently fetch the next list on demand" be a better way of explaining it? – Kyle Jul 17 '16 at 17:47