Efficient distributed counting

Question

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.

It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.

We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.

What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?

I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.

Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.

I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?

I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.

Example Requests:

How many unique people had a pepperoni pizza order in the past 10 minutes
How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes

let's say around 200 pizzas per second...and each pizza has like 6-7 different attributes we're counting (who ordered, who's the parent of the person who ordered...etc humor me) — Sam, Aug 26 '13 at 21:20
So let's say Jim Doe, orders 3 pizzas, then comes back 20 minutes later and order 2 pizzas (he has a binge eating problem), would those be considered 2 unique orders? Or is a order being unique predicated on the who? — VoronoiPotato, Aug 26 '13 at 21:22
Yes, the orders are considered unique. I would want to know things like, in the last 30 minutes, how many pizzas did Jim Doe order? In this case, the query would return 5. Also, say he ordered 3 pineapple pizzas the first time, and 2 pepperoni the second time. Then I'd have a pepperoni-pizza-count of 2 for the past 30 minutes, and a pineapple-pizza-count of 3 (independent of Jim Doe). If Jane Doe orders 2 more pepperoni pizzas right now, and I query for the last 30 min, pepperoni-pizza-count should be 4. — Sam, Aug 26 '13 at 22:18
(You can imagine I have ~20 kinds of pizza, and customers on the order of 100 million) — Sam, Aug 26 '13 at 22:30
You'd probably have to go into *a lot* more detail about the type of queries you wish to support. Either way, I'd probably just recommend an RDBMS. — Bernhard Barker, Aug 27 '13 at 14:04
I would be curious to know, how you ended up solving this!? FWIW, since the question was asked, the HyperLogLog data structure seems to have become a native part of redis. — polesen, Jan 17 '15 at 09:45

score 1 · Accepted Answer · answered Aug 29 '13 at 23:27

There are a few ways to approach this problem from what I have learned.

Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
Use a database that has built-in sets/collections support. They pretty much implement #1 internally.

score 0 · Answer 2 · answered Aug 26 '13 at 22:44

0

my guesses:

cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples (http://wiki.apache.org/cassandra/Counters)
cassandra can timeout a column..i never really used it, but it might worth a try

answered Aug 26 '13 at 22:44

Zoltán Haindrich

1,788
11
20

So incrementing a cassandra counter is obvious for actually counting, but how does one handle set membership in O(1) time and sublinear space? Bloom filter / hyperloglog were the things that came to mind but the question becomes how to use them in the context of a distributed db like cassandra – Sam Aug 26 '13 at 23:24
Also you'll have a better life if you pretend that counter columns don't exist in Cassandra. It's VERY easy for them to get stuff wrong. Also, they don't support TTLs, and you can't mix counters and non-counters in the same column family (except for the partition key). – Matt Mar 23 '16 at 12:52

Efficient distributed counting

Example Requests:

2 Answers2