0

Hi I am building a simple movie website and i am using google datastore as my backend.

I am in need to design top movie downloads per day, week, month , all time.

Now I am tracking only number of downloads eg: database is like a single row for a movie with num_of_downloads as a column, i am incrementing this field whenever a user downloads the movie.

Movie Id | Title | Description | num_of_downloads

Now what would be the best optimal way to show top movie downloads per day, week, month, all time.

I am thinking to implement a new table with movie key and date field to keep track of number of downloads per day with which i can calculate for rest of the problem.

Movie ID, Date,      Downloads
M1        01/01/2017  1000
M2        01/01/2017   100
M1        02/01/2017  2000

I am not sure this is the best way to design for appengine datastore also writing queries to get top 10 downloads this week, month is difficult as its not as easy as sql queries.

Community
  • 1
  • 1
Rinsen S
  • 461
  • 3
  • 10
  • 26

3 Answers3

0

That is not an easy problem to solve in datastore if you want the totals to be 100% correct. Involves manually updating such counters.

A better approach for your case is to also export or stream the downloads data to bigquery or google analytics as you serve those requests. Then, a single daily cron can calculate the stats and cache them somewhere like datastore or memcached.

Analytics is free but stats are not exact. Bigquery has costs but for your case they are low because its mostly storage (cheap) and 3 queries a day (cheap)

Zig Mandel
  • 19,571
  • 5
  • 26
  • 36
  • Hi Zig, thanks for the response, ya initially thought to get data from google analytics but it cannot be accurate on time as it depends on when the cron job is getting data from GA and commuting them. Also if we have more data then sampling would be a problem. Is there any other optimal way to design and solve the problem ?? – Rinsen S Jun 11 '17 at 09:38
  • that other way was already in my answer. I wrote two options, – Zig Mandel Jun 11 '17 at 14:39
0

Datastore is very limited and not a good match for statistics. I agree with Zig that you should go with an alternative service to solve this.

I suggest that you simply use Cloud SQL and keep one record for each downloaded movie and add all information that you got to that record. Since this relates to number of downloaded movies the numbers would be limited. As long as you use proper indexing and caching I have a hard time to imagine that this could cause any latency, pricing or quotas issues.

Having all information available in sql could come handy later on, both if you want to show more metrics to our users, or if you want it for internal purposes.

Arne S
  • 1,004
  • 14
  • 41
  • Hi thanks for the response Arne, but is it good to store all the info eg: movie name, no_of_downloads, timestamp for each download into cloud sql from appengine ?? I think it would be overhead, because whenever a movie is download we get request to appengine where we are updating counter_value , if we are going to use cloud sql then we need to insert one record with timestamp and if movie gets download 10K times there would be 10K records and with 1000 movies it would cross over 10M rows – Rinsen S Jun 11 '17 at 09:43
  • Wen are planning to build something that could handle 10m movie downloads keeping track of each download would not be the issue in terms of cost and complexity. If all you want is to be able to count number of downloads within set time intervals you could probably do it without maintaining information about each movie downloaded. However since the cost maintaining proper records are negligible and the benefits are major this is really a no-brainer. – Arne S Jun 11 '17 at 10:57
  • 10m rows for cloud sql (mysql) should not be an issue ref https://stackoverflow.com/a/5351306/1010991. You should probably do some level of normalization (movie_name = movie_id) and keep track of user_id, client platform etc. – Arne S Jun 11 '17 at 11:10
  • Ya i am pretty much able to get your solution and i think maintaining each record will give us lot of informations in future for data analysis. It's the trade off between cost and storage and computation. But what if we can use datastore to maintain the records instead of cloud sql ?? – Rinsen S Jun 11 '17 at 11:47
0

I'll offer a dissenting opinion over the other two answers and suggest a possible app engine solution.

Create a entity for accumulating stats for each day:

class MovieStats(ndb.Model):
    day = ndb.DateTimeProperty()
    movies = ndb.KeyProperty(repeated=True)

You then update the MovieStats entity for the day for each movie download. You'll need to do this with transactions of course, and do so in a way that avoids contention. One solution is sharded counters, but I really like the solution described by Nick Johnson in two blog posts:

I don't want to repeat the posts here, but the basic idea is that for each movie download, you add a task to a pull queue to update the counter for the movie. Then, once a minute (or some other interval), you process all the tasks in the pull queue and update your MovieStats entity in a transaction.

For processing the pull queue tasks, you could have a cron job that launches every minute, or use the "on-demand" CRON jobs described above. Each time you add a task to the pull queue, you create a worker task (regular push queue) to process tasks in the pull queue. For the worker task, you add a task name corresponding to a time interval to make sure you only have one worker task in the time interval. It allows you to get the benefit of 1-minute CRON job but the added performance bonus that it only runs when needed so you don't have a CRON job running when not needed.

new name
  • 15,861
  • 19
  • 68
  • 114