Python: Efficient way to split list of strings into smaller chunks by concatenated size

Question

I am communicating with Google API via batch requests through its google-api-python-client. In the batch requests there are limitations:

A batch request can not contain more than 1000 requests,
A batch request can not contain more than 1MB in the payload.

I have random number of random length strings in a list, from which I need to construct a batch request while keeping the aforementioned limitations in mind.

Does anyone know a good way to efficiently build chunks of that original list that can be submitted to Google API? By 'efficiently' I mean, not iterating through all elements from part one (counting the payload size).

So far, that's what I had in mind: take at maximum 1000 piece of the items, build the request, see the payload size. If it's bigger than 1M, take 500, see the size. If the payload is bigger, take the first 250 items. If the payload if smaller, take 750 items. And so on, you get the logic. This way, one could get the right amount of elements with less iterations than building the payload while checking it after each addition.

I really don't want to reinvent the wheel, so if anyone knows an efficient builtin/module for that, please let me know.

The body payload size can be calculated by calling _serialize_request, when you've added the right amount of requests to the instantiated BatchHttpRequest.

See also the Python API Client Library documentation on making batch requests.

Why do you think calling `_serialize_request` over and over will be more efficient than iterating once yourself over the same items? — John La Rooy, Jun 22 '15 at 13:24
I think you don't get the problem. In order to calculate the right amount of items to be put into the batch in the iterative way, you would have to: - create a request item with the string in the list - call `_serialize_request` to see if it's not bigger than the maximum allowed size. as you've probably seen, `_serialize_request` iterates through all the requests in the batch object to compile a MIME encoded body payload. In that way, iteratively putting in one more object and fetch the payload is a huge overhead, which I'd like to avoid, hence my question. — karolyi, Jun 22 '15 at 13:31

karolyi · Accepted Answer · 2015-06-22T17:01:14.713

Okay, it seems I created something that solves this issue, here's a draft of the idea in python:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import random
import string
import sys

MAX_LENGTH = 20
MAX_SIZE = 11111


def get_random():
    return ''.join([
        random.choice(string.ascii_letters) for i in range(
            random.randrange(10, 1000))])


def get_random_list():
    return [get_random() for i in range(random.randrange(50, 1000))]


def get_str_length(rnd_list, item_idx):
    return len(''.join(rnd_list[:item_idx]))

rnd_list = get_random_list()


def calculate_ideal_amount(rnd_list):
    list_bounds = {
        'first': 1,
        'last': len(rnd_list)
    }
    print ('ORIG_SIZE: %s, ORIG_LEN: %s' % (
        get_str_length(rnd_list, len(rnd_list)), len(rnd_list)))
    if get_str_length(rnd_list, list_bounds['first']) > MAX_SIZE:
        return 0
    if get_str_length(rnd_list, list_bounds['last']) <= MAX_SIZE and \
            list_bounds['last'] <= MAX_LENGTH:
        return list_bounds['last']
    while True:
        difference = round((list_bounds['last'] - list_bounds['first']) / 2)
        middle_item_idx = list_bounds['first'] + difference
        str_len = get_str_length(
            rnd_list, middle_item_idx)
        print(
            'MAX_SIZE: %s, list_bounds: %s, '
            'middle_item_idx: %s, diff: %s, str_len: %s,' % (
                MAX_SIZE, list_bounds, middle_item_idx, difference, str_len))
        # sys.stdin.readline()
        if str_len > MAX_SIZE:
            list_bounds['last'] = middle_item_idx
            continue
        if middle_item_idx > MAX_LENGTH:
            return MAX_LENGTH
        list_bounds['first'] = middle_item_idx
        if difference == 0:
            if get_str_length(rnd_list, list_bounds['last']) <= MAX_SIZE:
                if list_bounds['last'] > MAX_LENGTH:
                    return MAX_LENGTH
                return list_bounds['last']
            return list_bounds['first']

ideal_idx = calculate_ideal_amount(rnd_list)

print (
    len(rnd_list), get_str_length(rnd_list, len(rnd_list)),
    get_str_length(rnd_list, ideal_idx), ideal_idx,
    get_str_length(rnd_list, ideal_idx + 1))

This code does exactly the same I tried to describe, by finding and modifying the bounds of the list while measuring its returned (concatenated) size, and then giving back the index of the list where it should be sliced in order to achieve the most efficient string size. This method avoids the CPU overhead of compiling and measuring the list one by one. Running this code will show you the iterations it does on the list.

The get_str_length, lists and other functions can be replaced to use the corresponding functionality in the API client, but this is the rough idea behind.

However the code is not foolproof, the solution should be something along these lines.

I've never understood why people like to take a language designed for readability and turn in into obfuscated PHP. I hope this code makes sense to you next week. — msw, Jun 22 '15 at 17:14

Python: Efficient way to split list of strings into smaller chunks by concatenated size

1 Answers1

Linked