7

Imagine you got an entity in the Google App Engine datastore, storing links for anonymous users. You would like to perform the following SQL query, which is not supported:

SELECT DISTINCT user_hash FROM links

Instead you could use:

user = db.GqlQuery("SELECT user_hash FROM links")

How to use Python most efficiently to filter the result, so it returns a DISTINCT result set? How to count the DISTINCT result set?

Federico Elles
  • 891
  • 3
  • 14
  • 24

4 Answers4

5

Reviving this question for completion:

The DISTINCT keyword has been introduced in release 1.7.4.

You can find the updated GQL reference (for example for Python) here.

Bernd Verst
  • 836
  • 8
  • 12
3

A set is good way to deal with that:

>>> a = ['google.com', 'livejournal.com', 'livejournal.com', 'google.com', 'stackoverflow.com']
>>> b = set(a)
>>> b
set(['livejournal.com', 'google.com', 'stackoverflow.com'])
>>> 

One suggestion w/r/t the first answer, is that sets and dicts are better at retrieving unique results quickly, membership in lists is O(n) versus O(1) for the other types, so if you want to store additional data, or do something like create the mentioned unique_results list, it may be better to do something like:

unique_results = {}
>>> for item in a:
    unique_results[item] = ''


>>> unique_results
{'livejournal.com': '', 'google.com': '', 'stackoverflow.com': ''}
unmounted
  • 33,530
  • 16
  • 61
  • 61
  • A set object is an unordered collection of distinct hashable objects.(...) New in version 2.4. http://www.python.org/doc/2.5.2/lib/types-set.html – Federico Elles Oct 28 '08 at 08:27
  • 1
    Set is okay if the number of records is relatively small. But if you have gazillions of records in the datastore, it would be quite inefficient! A much better strategy would be to pre-calculate and store the result at the insert/update time. – sudarkoff Jan 19 '09 at 03:38
1

One option would be to put the results into a set object:

http://www.python.org/doc/2.6/library/sets.html#sets.Set

The resulting set will consist only of the distinct values passed into it.

Failing that, building up a new list containing only the unique objects would work. Something like:

unique_results = []
for obj in user:
    if obj not in unique_results:
        unique_results.append(obj)

That for loop can be condensed into a list comprehension as well.

James Bennett
  • 10,903
  • 4
  • 35
  • 24
0

Sorry to dig this question up but in GAE I cannot compare objects like that, I must use .key() for comparison like that:

Beware, this is very inefficient :

def unique_result(array):
    urk={} #unique results with key
    for c in array:
        if c.key() not in urwk:
            urk[str(c.key())]=c
    return urk.values()

If anyone has a better solution, please share.

Carlos Ricardo
  • 2,058
  • 25
  • 32
  • There's another question asking how to do this in Datastore, and the basic answer is that you can't: http://stackoverflow.com/questions/1183102/how-to-get-the-distinct-value-of-one-of-my-models-in-google-app-engine. I tried to suggest some vague ideas how to denormalize, but it's possible the state of the art has moved on since then. – Steve Jessop Sep 28 '11 at 23:28