30

I figure one way to do a count is like this:

foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()

What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...

cletus
  • 616,129
  • 168
  • 910
  • 942
barneytron
  • 7,943
  • 3
  • 23
  • 25
  • Why do you need to know the count? When I saw your post I upvoted it because I need this funcionality too. Then I realized I don't really need it if I present things in a different way. – asterite Jan 07 '09 at 20:07
  • I'm trying to gather user input, say thumbs up/down ratings. But I want to track them with detail like timestamp and others. I could've used an entity with 2 columns: up and down, and just incr accordingly, but then I'd need 2 entities. And gql supports updating only 1 entity in a trans. – barneytron Jan 07 '09 at 20:31
  • I'm hoping that someone can point me to some documentation that I missed on how to do this. Or that someone has the scoop on if this functionality is on the gql roadmap. – barneytron Jan 07 '09 at 20:32

9 Answers9

20

You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.

class CategoryCounter(db.Model):
    category = db.StringProperty()
    count = db.IntegerProperty(default=0)

then when creating a Bar object, increment the counter

def createNewBar(category_name):
  bar = Bar(...,baz=category_name)

  counter = CategoryCounter.filter('category =',category_name).get()
  if not counter:
    counter = CategoryCounter(category=category_name)
  else:
    counter.count += 1
  bar.put()
  counter.put()

db.run_in_transaction(createNewBar,'asdf')

now you have an easy way to get the count for any specific category

CategoryCounter.filter('category =',category_name).get().count
Jehiah
  • 2,739
  • 22
  • 18
  • 1
    I see that there are 2 .put() calls on 2 different entities in your trans. If I remember, I got an error telling me I could not update two root entities in the same trans when I tried something similar before. Maybe put both (bar & counter) in the same group? I'll try again when I have more time. – barneytron Jan 08 '09 at 20:45
  • 1
    I confirmed the error I got doing the two .put() calls in a transaction: "Cannot operate on different entity groups in a transaction" I still like the idea of using two entities though. – barneytron Jan 11 '09 at 01:26
  • Shouldn't the default counter value be 1? – dave paola Aug 09 '10 at 20:04
17

+1 to Jehiah's response.

Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.

zgoda
  • 12,775
  • 4
  • 37
  • 46
  • 4
    Losing the ability to track counters on a per-user basis makes it a lot harder to weed out spammers though. How can you tackle this issue when using sharded counters? – Luke May 08 '09 at 07:43
7

Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.

This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • 4
    It's not always slow -- there are some that maintain the sharded counter for you, expecting queries like SELECT COUNT(*) FROM MyTable. (IIRC, MyISAM in MySQL does that.) And your page render times would only increate linearly, not exponentially. If it were really exponential, you'd see it after hundreds of records, and not have to wait for tens of thousands. – me22 Dec 21 '10 at 05:02
2

According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:

from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)

Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.

rescdsk
  • 8,739
  • 4
  • 36
  • 32
0

We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:

Datastore Statistics

TheAddonDepot
  • 8,408
  • 2
  • 20
  • 30
0

I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?

LIMIT=1000
def count(query):
   result = offset = 0
   gql_query = db.GqlQuery(query)
   while True:
     count = gql_query.fetch(LIMIT, offset)
     if count < LIMIT:
       return result
     result += count
     offset += LIMIT
orip
  • 73,323
  • 21
  • 116
  • 148
0

orip's solution works with a little tweaking:

LIMIT=1000
def count(query):
    result = offset = 0
    gql_query = db.GqlQuery(query)
    while True:
        count = len(gql_query.fetch(LIMIT, offset))
        result += count
        offset += LIMIT
        if count < LIMIT:
            return result
dfichter
  • 1,078
  • 8
  • 9
0

As pointed out by @Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.

To query the statistics for a given Kind, you can use the following GQL structure:

select * from __Stat_Kind__ where kind_name = 'Person'

There are a number of properties returned by this which are helpful:

  • count -- the number of Entities of this Kind
  • bytes -- total size of all Entities stored of this Kind
  • timestamp -- an as of date/time for when the stats were last computed

Example Code

To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:

/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
    var query = new GqlQuery
    {
        QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
        AllowLiterals = true
    };
    var result = database.RunQuery(query);
    return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
Jonathan B.
  • 2,742
  • 1
  • 21
  • 18
  • How do you access those properties? We tried query_result.count but it was empty – Praxiteles Mar 16 '18 at 00:02
  • @Praxiteles, what library or language are you running your code from? I'm using the .NET / C# library. I'll update my answer with the code I'm using. – Jonathan B. Mar 18 '18 at 03:29
-1

The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.

def MyObj(db.Model):
  num = db.IntegerProperty()

When you create a new object, you must manually retrieve the highest key:

max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()

This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.

Then, when you want to fetch past the 1000th object limit, you simply do:

MyObj.all().filter('num > ' , 2345).fetch(67)

I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

  • 1
    Offset doesn't work past 1000 limit because doing that means you're using a http://en.wikipedia.org/wiki/Schlemiel_the_painter%27s_Algorithm . And while using .filter('foo >', bar) is a good way to page through all the results, there's no need to add a pointless field to your record, since they're ordered on key already. Just grab the last key you looked at, and start from (after) there. – me22 Dec 21 '10 at 05:11