3

I would like to write a custom web analytics and action tracking solution and use it as a feedback mechanism eg. for searches or content suggestion.

If it were only for short lived data, I would use some NOSQL engine with a limited data retention. But ideally I would like to keep a long history.

One nice aproach I have seen in the past, was to use mysql for storage, one table per month, with older tables being converted to MySQL ARCHIVE format. For browsing the archives and aggregate data, MySQL views were implemented.

My question: How does something like Google Analytics store its data? In a structured database or something else. Which way would you suggest to avoid a long-term memory hog, while keeping query possibilities flexible?

(I am not concerned about writing speeds into the DB, this will happen in async batches and not in realtime)

Martin
  • 243
  • 3
  • 16
  • You can try some BI tools. For example, Infobright has a Community edition, that might be suitable for your work. It stores data internally in column-based format but you can use standard SQL to gather statistics. – ypercubeᵀᴹ Feb 18 '12 at 12:07
  • Thanks ypercube, this could maybe have afaster learning curve for me. – Martin Feb 18 '12 at 12:30

2 Answers2

2

Google uses its own Big Table implementation to store its data. If you are interested in big data solutions and utilizing big data you should take a look at this. For an open source implementation built off of google's Big Table check out Hbase/Hadoop. I will post back some links in a minute.

The analytics themselves done on this type of data utilize map/reduce operations.

j0k
  • 22,600
  • 28
  • 79
  • 90
Ryan
  • 2,755
  • 16
  • 30
  • Thank you very much for pointing me to Hbase. I knew Hadoop existed as I probably will have to go that way for another thing I am working on, so this fits into the strategy. – Martin Feb 18 '12 at 12:14
1

I think that Urchin was originally using it's own custom built multi-dimensional database, but I'm not sure if Google Analytics is still using it. Anyway, analytics systems often use Cube based schemas for fast OLAP browsing.

I know Microsoft and Oracle have this type of functionality in their database tools, but they're very expensive. I'm not aware of any open-source cube databases; however if you're using Java then I think Mondrian works in the same way, but using a relational database for storage.

Ewan Heming
  • 4,628
  • 2
  • 21
  • 20
  • I spend a lot of time yesterday, trying to get Oracle running on Debian to test a part of their Enterprise edition. It was such a pain and I think I would not use it in a productive environment on a not officially compatible OS. But thank you very much for the keywords. – Martin Feb 19 '12 at 12:46