When accessing HDF5 via pandas
, I sometimes face the documented bug that one cannot make more than 31 select conditions. To circumvent this, I decided to split up the select conditions, create a batch of iterators and then concatenate the results at the end.
Here's my approach:
generators = []
for idChunk in grouper(conditionList, args.chunksizeGrouper, None):
# creates chunks of length args.chunksizeGrouper, will add [None] to fill up shorter lists
where = 'myId in [{}]'.format(
', '.join([str(id) for id in idChunk if id is not None]))
iteratorGenerator = lambda: hdfStore.select(
'table',
where=where,
iterator=True,
chunksize=args.chunksize
)
generators.append(iteratorGenerator)
#finally
doSomethingWithGeneraetors(generators, df, args)
That is, I want to create a list of generators. Calling generator()
will return the select statement, corresponding to the where condition.
I then store all of these in a list.
For completeness, here's the `grouper():
def grouper(iterable, n, fillvalue=None):
'''
takes an iterable and returns chunks.
'''
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
One of the problem that arises is my definition of where
rewrites the lambda function retrospectively: When I am in the second round, the newly defined where
(based on the second idCunk
) will replace the lambda
generator that is inside generators[0]
.
How can I circumvent this issue? Is there something else I need to be aware of with this structure, or is there a better way of dealing with this?