0

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.

My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.

My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.

Below is my code for how it is currently implemented and breaking. Any input is much appreciated:

wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
    d = [acct.time.strftime(date_string)]
    for attr in wData.keys():
        d.append(str(acct.dict_access(attr)))
        wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
    wDataCsv += '\\n' + ','.join(d)

# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
    # model for data to save
    ...
    # Function for querying data below a given ancestor between two optional
    # times
    @classmethod
    def time_ordered_query(cls, ancestor_key, start=None, end=None):
        return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)

EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:

wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
    gc.collect()
    fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
    if fetched:
        for acct in fetched:
            d = [acct.time.strftime(date_string)]
            for attr in wData.keys():
                d.append(str(acct.dict_access(attr)))
                wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
            wDataCsv += '\\n' + ','.join(d)
    if more and next_cursor:
        cursor = next_cursor
    else:
        break

where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.

EDIT:

Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!

awerchniak
  • 357
  • 3
  • 16

1 Answers1

2

You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.

You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?

You can bump your instance class to one with more memory (if not done already).

You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.

Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

Dan Cornilescu
  • 39,470
  • 12
  • 57
  • 97
  • Thanks so much! Projection queries definitely answer my question. The only problem I have here is that my webpage works as checkboxes, where the user can decide which properties they want to graph. Seeing as there are 21 checkboxes, this makes 21 choose 1 + 21 choose 2 + ... + 21 choose 21 ways to generate graphs, and for projection queries an index must be specified for each one of those ways. Is there a way to create flexible indexes, or somehow get around this limitation? Thanks! – awerchniak Feb 24 '18 at 18:46
  • To get around this index problem I actually wrote a quick script to generate all of those combinations, but as expected searching this unsorted index list to run queries takes the machine too long to be usable in this case. I also tried the iterative approach, as is edited in the original question. – awerchniak Feb 24 '18 at 20:52
  • 1
    The properties need to be indexed, but that doesn't necessarily mean you need a distinct composite index for each one, there *might* be ways around it. This might be of interest: https://stackoverflow.com/questions/48388128/increasing-google-cloud-datastore-composite-index-limit/48389683#48389683 – Dan Cornilescu Feb 25 '18 at 06:50