Large Data: Storage and Query

Question

There is a similar question Large data - storage and query But I think my question is even more harder.

I have a huge data set, to simplify the problem, suppose each record has 3 fields,

Date, Location, Object

means on which day, object visits which place.

For each day there can be 50 million new records. I need to keep 100 days data.

And queries are like these:

Given a date range, list all the objects who have visited specific location X;
Given a date range, list all the locations a specific object have visited.

How should I design on this? Is it possible to give a solution? Any tools should I have a try? Can I use mongodb or hadoop?

And what's more, the query should not take much long time since it will be queried from front end.

Thanks.

Neville Kuyt · Answer 1 · 2013-01-20T22:10:07.153

The queries you describe are perfectly suited for a relational database. Whilst you will have a large amount of data, the queries lend themselves well to a fairly simple index scheme.

Some commercial databases have geo-spatial extensions which would allow you to extend the queries to "given a date range, tell me which objects have been in within 20 kilometers of location x".

It also seems that whilst you have a large number of rows, the actual data size is fairly limited; it's not unreasonable to expect it to fit into memory on a high-end machine.

Most database systems can handle very large tables - there's no logical limit to the number of records an RDBMS holds, though there are obviously practical limits. Oracle has a solid reputation for performance with large data sets, though it's definitely worth getting an experienced Oracle DBA to help. A common strategy when handling huge amounts of data is "sharding" - putting different records in different tables and/or servers. If all your queries are date-based, you might put the data for each month on different physical servers, for instance.

I'd start with an RDBMS, create a test data set to work out if it meets your scalability needs by running and tuning sample queries. Tune the hardware, and add more if you can afford to.

I don't think you will get much benefit from Hadoop - you're not doing much processing, you're just searching a large dataset.

MongoDB is designed to work with document-style data; your data seems relational in nature, rather than being a document. You could build this in MongoDB, but I'm not sure you'd get much benefit.

If use relational DB, my question is for typical database like sybase, db2, do they have a limitation on records in a table? For my question I will have billions of records finally. — taox, Jan 18 '13 at 13:58
Yes, an RDBMS should be able to handle that number of records - I've updated the answer. You may need to get a DBA to help tune your database, though. — Neville Kuyt, Jan 20 '13 at 22:11

score 0 · Answer 2 · answered Jan 18 '13 at 12:05

That problem is not as hard as it sounds. You need two indexes:

(Location, Date, Object)
(Object, Date, Location)

Those two indexes satisfy your queries optimally. There will be no inherent scalability limit that you need to worry about.

You might consider partitioning on Date so that you can drop old partitions efficiently.

Large Data: Storage and Query

2 Answers2