1

I have two data sets in google app engine datastore.

class First_Set(db.Model):
  start_time = db.DateTimeProperty()
  end_time = db.DateTimeProperty()
  data1 = db.FloatProperty()
  ...

class Second_Set(db.Model):
  start_time = db.DateTimeProperty()
  end_time = db.DateTimeProperty()
  data2 = db.FloatProperty()
  ...

(They have other different data that's why they're in different datasets.)

I'd like to find the datastore IDs all the overlapping start_time and end_time across two datasets, ideally without pulling results from one and iterating the first results over the other.

A great visualization of the initial dataset is from here (it also has the problem solved in SQL):

1     |-----| 
2        |-----| 
3                 |--| 
4                       |-----| 
5                          |-----| 
6                                  |---| 
7                                        |---|  
8                           |---| 
9                                       |-----|

End result I need is something in the tune of (from the same example):

+----+---------------------+----+---------------------+ 
| id | start               | id | end                 | 
+----+---------------------+----+---------------------+ 
|  2 | 2008-09-01 15:02:00 |  1 | 2008-09-01 15:04:00 | 
|  5 | 2008-09-01 16:19:00 |  4 | 2008-09-01 16:23:00 | 
|  8 | 2008-09-01 16:20:00 |  4 | 2008-09-01 16:22:00 | 
|  8 | 2008-09-01 16:20:00 |  5 | 2008-09-01 16:22:00 | 
|  7 | 2008-09-01 18:18:00 |  9 | 2008-09-01 18:22:00 | 
+----+---------------------+----+---------------------+ 

SQL solution is described in the example as below but I couldn't do this in datastore because of lack of JOIN:

SELECT v1.id, v1.start, v2.id, LEAST(v1.end,v2.end) AS end 
FROM visits v1 
JOIN visits v2 ON v1.id <> v2.id and v1.start >= v2.start and v1.start < v2.end  
ORDER BY v1.start;

I understand that one-to-many version of this is rather straightforward using a ListProperty() (from this question).

Can anyone think of a solution to find the overlapping times (ideally in Python)?

Community
  • 1
  • 1
Saint
  • 391
  • 3
  • 10

2 Answers2

2

Look into Marzullo's algorithm its time efficiency is O(n log n).
There are also many question on Stackoverflow that cover overlapping intervals which can be used to solve your problem on AppEngine.

Community
  • 1
  • 1
Shay Erlichmen
  • 31,691
  • 7
  • 68
  • 87
  • Sweet, this is actually much more scalable/manageable than what I had in mind as it works with any number of datasets. Thanks, am posting my quick and dirty answer using this methodology. – Saint Jun 23 '12 at 14:45
1

Posting my solution with no JOINs, thanks to Shay's direction. Should be able to find overlaps over any number of datasets with minor edits (at least that's the theory).

My Python isn't that great but below should give the idea:

from operator import itemgetter

class Find_Overlaps(webapp2.RequestHandler):
    def get(self):
        all_dates = []
        first_dates = db.GqlQuery("SELECT * FROM First_Set")
        for date in first_dates:
            row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
            all_dates.append(row)
            row = {'dataset':'First_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
            all_dates.append(row)

        second_dates = db.GqlQuery("SELECT * FROM Second_Set")
        for date in second_dates:
            row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.start_time, 'type': -1}
            all_dates.append(row)
            row = {'dataset':'Second_Set', 'dbkey':date.key(), 'offset':date.end_time, 'type': 1}
            all_dates.append(row)

        newlist = sorted(all_dates, key=itemgetter('offset','type'))
        number_datasets = 2 #goal is to find overlaps in all sets not only the best overlaps, that's why this is needed
        loopcnt = 0
        update_bestend = 0
        overlaps = []
        for row in newlist: #Below is mostly from Marzullo's alghorithm
            loopcnt = loopcnt - row['type']#this is to keep track of overall tally
            if update_bestend == 1:
                if loopcnt == (number_datasets - 1):
                    bestend = row['offset']
                    end_set = row['dataset']
                    end_key = row['dbkey']
                    overlaps.append({'start':beststart,'start_set':start_set,'start_key':start_key,'end':bestend,'end_set':end_set,'end_key':end_key})
                    update_bestend = 0
            if loopcnt == number_datasets:
                beststart = row['offset']
                start_set = row['dataset']
                start_key = row['dbkey']
                update_bestend = 1

        for overlap in overlaps: #just to see what the outcome is
            self.response.out.write('start: %s, start_set: %s, end: %s, end_set: %s<br>' % (overlap['start'], overlap['start_set'], overlap['end'], overlap['end_set']))
Saint
  • 391
  • 3
  • 10
  • You may run into problems with this in app engine for large datasets. Each query will return up to 1000 results, so if you have more than 1000 entries in a set, you'll run into problems. You can do multiple queries to get the full sets, but then you may also run into request timeouts if the process takes to long. In that case, you may want to use a backend or mapreduce to handle the operation. – dragonx Jul 09 '12 at 01:21