4

How to do this on Google App Engine (Python):

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

Long version:

I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.

My event entities look like this:

class Event(db.Model):
    t = db.DateTimeProperty(auto_now_add=True)
    user = db.StringProperty(required=True)
    event_type = db.StringProperty(required=True)

With an SQL database, I would do something like

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" 
AND t >= start_time AND t <= end_time

First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:

query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
    usernames.append(event.user)
answer = len(set(usernames))

But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.

Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".

It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like

class Chard(db.Model):
    usernames = db.StringListProperty(required=True)

So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:

chards = Chard.all()
all_usernames = set()
for chard in chards:
    all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)

It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.

Is there any prettier solution to my problem?

Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.

Bemmu
  • 17,849
  • 16
  • 76
  • 93
  • Is it important to be able to have rolling weeks, that is to say, the past seven days from any given date, or would you be OK with counting them in calendar weeks? – Adam Crossland Jan 29 '10 at 14:29
  • Have you looked into getting your data out of Analytics using the Analytics Feed API? http://code.google.com/apis/analytics/docs/gdata/gdataReferenceDimensionsMetrics.html – Adam Crossland Jan 29 '10 at 14:33
  • I guess you are wanting to suggest adding a field like "calendar_week" so I could do SELECT stuff WHERE calendar_week = some_week? That would be one solution, perhaps I could have that in addition to the timestamp, so I could still do any timespan I want later. Still, I'm curious if there is any other more flexible solution. On the second point, I don't really want to get my data out of analytics, because there are other tasks I need to do that Analytics doesn't support which need info I have in the data store. Need set of unique users for histograms of behavior etc. – Bemmu Jan 29 '10 at 14:44
  • With the first question, I was thinking that you could create a sharded counter that would record activity by weeks. You would decided whether or not to increment the counter by keeping a 'last_activity' date field for each User. When the user had activity, if their last activity was more than a week ago, you could increment the sharded counter for the current week. – Adam Crossland Jan 29 '10 at 14:47
  • With the Google Analytics Feed suggestions, I am think that you could use the feed to integrate the data into your dashboard without ever visiting Analytics directly, and it could live side-by-side with data gathered and managed in you app itself. I make this suggestion mainly because your options for handling this type of a task with only AppEngine resources are going to be extremely limited and extremely difficult to implement. – Adam Crossland Jan 29 '10 at 14:50

4 Answers4

4

As of SDK v1.7.4, there is now experimental support for the DISTINCT function.

See : https://developers.google.com/appengine/docs/python/datastore/gqlreference

matt burns
  • 24,742
  • 13
  • 105
  • 107
1

NDB still does not support DISTINCT. I have written a small utility method to be able to use distinct with GAE.

See here. http://verysimplescripts.blogspot.jp/2013/01/getting-distinct-properties-with-ndb.html

specialscope
  • 4,188
  • 3
  • 24
  • 22
1

Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.

You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.

Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.

At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:

class UniqueVisitsRecord(db.Model):
    # be careful setting date correctly if processing at midnight
    activity_date = db.DateProperty()
    event_count = IntegerProperty()

You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.

Adam Crossland
  • 14,198
  • 3
  • 44
  • 54
  • 1
    This relies on your value staying in memcache for an entire day. memcache is a cache, not reliable storage; this is only a good answer if you're happy to lose your count all the time. – Wooble Mar 30 '10 at 12:45
0

Google App Engine and more particular GQL does not support a DISTINCT function.

But you can use Python's set function as described in this blog and in this SO question.

Community
  • 1
  • 1
Tom van Enckevort
  • 4,170
  • 1
  • 25
  • 40
  • Thanks. I was aware of that SO question and the blogpost, but they do not apply to this situation due to the size of the task. – Bemmu Jan 29 '10 at 14:40