1

When accessing HDF5 via pandas, I sometimes face the documented bug that one cannot make more than 31 select conditions. To circumvent this, I decided to split up the select conditions, create a batch of iterators and then concatenate the results at the end.

Here's my approach:

generators = []
for idChunk in grouper(conditionList, args.chunksizeGrouper, None):
    # creates chunks of length args.chunksizeGrouper, will add [None] to fill up shorter lists
    where = 'myId in [{}]'.format(
        ', '.join([str(id) for id in idChunk if id is not None]))
    iteratorGenerator = lambda: hdfStore.select(
        'table',
        where=where,
        iterator=True,
        chunksize=args.chunksize
        )
    generators.append(iteratorGenerator)
#finally
doSomethingWithGeneraetors(generators, df, args)

That is, I want to create a list of generators. Calling generator() will return the select statement, corresponding to the where condition.

I then store all of these in a list.

For completeness, here's the `grouper():

def grouper(iterable, n, fillvalue=None):
    '''
    takes an iterable and returns chunks.
    '''
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

One of the problem that arises is my definition of where rewrites the lambda function retrospectively: When I am in the second round, the newly defined where (based on the second idCunk) will replace the lambda generator that is inside generators[0].

How can I circumvent this issue? Is there something else I need to be aware of with this structure, or is there a better way of dealing with this?

cheesecake
  • 93
  • 4

0 Answers0