I have a query like:
query = HistoryLogs.query()
query = query.filter(HistoryLogs.exec_id == exec_id)
iter = query.iter()
for ent in iter:
# write log to file, nothing memory intensive
I added logs in the for loop and reading 10K rows increases memory usage by 200MB, then reading the next 10K rows adds extra 200MB and so on. Reading 100K requires 2GB, which exceeds the highmem memory limit.
I tried clearing the memcache in the for loop, after reading 10K rows, by adding:
# clear ndb cache in order to reduce memory footprint
context = ndb.get_context()
context.clear_cache()
in the for loop, on each 10K-th iteration, but it resulted in the query being timed out, error BadRequestError: The requested query has expired. Please restart it with the last cursor to read more results. ndb
was raised.
My initial expectation was that by using query.iter()
instead of query.fetch()
I wouldn't face any memory issue and the memory would be pretty much constant, but that isn't the case. Is there a way to read the data with iterator, without exceeding time nor memory limits? By clearing the context cache I see the memory consumption is pretty much constant, but I ran into troubles with the time it takes to retrieve all rows.
BTW, there are a lot of rows to be retrieved, up to 150K. Is it possible to get this done with some simple tweaks or I need a more complex solution, e.g. one which would use some parallelization?