1

I have the following problem. I need to 150 MM records/day for 10 years. The total of records 150MM * 365 * 10 = 547.500.000.000 records. Database records have a unique key {date, id}. I need to recover 40MM records daily, using this database. I will always use the key {date, id} to search. The process can be run in batch. I thought about using a key-value database, such as HBase, sharding my database by date. (not sure if HBase allows you to choose how to partition the records within the cluster.). Or simply leave the HBase sharding for me.

I saw a simmilar question that uses MYSQL partitioning. ( Efficiently storing 7.300.000.000 rows ) I don't know if MYSQL can partitioning in multiple machines. Or maybe if i can use just one machine to handle this problem.

Do you believe that this architecture will work? If not, what would be another way to solve the problem? Suggestions and tips are welcome!

Community
  • 1
  • 1
p.magalhaes
  • 7,595
  • 10
  • 53
  • 108
  • 4
    What on _Earth_ requires you to store that much data? Are you logging CERN data or something? – Bojangles Jul 18 '14 at 14:21
  • 1
    More than 4 TB just for key. Not bad. Now my question: if data is ordered by date are you **sure** you need a database? A fixed layout storage maybe splitting somehow (per year?)? Both searching and load/store can be faster if you don't really need SQL _features_ – Adriano Repetti Jul 18 '14 at 14:24
  • 1
    Indeed, is your data *relational*? If not, then a *relational* database management system (such as MySQL) is definitely not the right tool for the task, for this volume of data. – RandomSeed Jul 18 '14 at 14:28
  • @Bojangles well it's not SO much. Imagine (just to pick an example) you're writing control software for a power plant. Even aggregating some data you'll easy have more than 1000 sensors (probably 4x/5x more). Now imagine 1 Hz sampling rate. Worst (values always changing) case is 1000 * 60 * 60 * 24 * 365 = 31,000,000,000 samples per year (but probably it'll be 4 or 5 times more). – Adriano Repetti Jul 18 '14 at 14:30
  • Anyway IMO if (with a big IF AND ONLY IF) you need to store so much data in a DB (you'll use its _features_) then I'd split them across multiple (powerful) machines. Find a way to do it (based on date? id range?), logic can be in each client or you may use a _query coordinator_. – Adriano Repetti Jul 18 '14 at 14:33
  • This looks like an X-Y problem (http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). What is the actual business problem being addressed? You have given a proposed solution **but not specified the business problem actually being addressed.** – Pieter Geerkens Jul 18 '14 at 14:39
  • @Bojangles: It is data about a credit bureau in Brazil, storing transactions. We have 200 MM of inhabitants. – p.magalhaes Jul 18 '14 at 16:06
  • @AdrianoRepetti: I don't need SQL features. Could you plz tell more about your solution? Thanks – p.magalhaes Jul 18 '14 at 16:08
  • 2
    @ PieterGeerkens: I have 150 MM records of transactions per day that are identified with a key {id, date}. This file is generated daily by the mainframe. I need to store these files historically, and given a key {id, date} retrieves it from this database. I do not know how I can be more clear. – p.magalhaes Jul 18 '14 at 17:26
  • 1
    You may get some inspiration from what Google and Facebook did. Check their **Bigtable** and **Haystack** implementations. Don't forget there are other implementations for such kind of databases (**Hadoop**, **Cassandra**, **DynamoDB** but also many others I don't remember). – Adriano Repetti Jul 21 '14 at 11:01
  • 1
    In general _standard_ DBs aren't good choice if you don't need relations, SQL queries and joins. IMO it's not a good choice for the big overhead, enterprise level DBs (then I'd exclude MySQL and others) can manage (with a careful setup) such amount of data but it _may_ be a waste (in resources, power and management). – Adriano Repetti Jul 21 '14 at 11:22

1 Answers1

2

This is a fairly large amount of data, and there are many potential solutions. HBase should be a better choice than MySQL, as MySQL will add a lot of overhead for transaction guarantees and other things that you probably don't care about. You can shard with MySQL across many servers, but still there is additional overhead that is not necessary. HBase supports configurable sharding, so if you shard by Date it could work well.

There is another potential option to do-it-yourself if you are a Java developer. MapDB (http://www.mapdb.org) is an open source Java key-value database, and it has some interesting features that can help. One very strong feature is key compression, such that the Date portion of your key can be stored once, and the ID portion of the key can be the actual unique identifier within a specific Date. This would greatly reduce the size of your data, as in any traditional database, each row will have the Date value replicated 150MM times for your data set. At 8 bytes for the date value, that is a fair chunk of space per day that is wasted, and can slow queries.

MapDB does not currently have a server implementation, so you would need to wrap it in a process, and you could run that on many servers. This solution is clearly more work than HBase, but it could be optimized to perform very well.

There are many other ideas being worked on around MapDB that will be available in the future to make something like this easier.

In summary, HBase is likely the easy way to do this and it should work fine for your volumes and queries. If you want to experiment with low-level approach that gives you greater control, you can consider MapDB. A traditional relational DBMS like MySQL will add a lot of overhead you don't require, and will require a sharding setup, so this is not a very good fit.

dbschwartz
  • 763
  • 3
  • 10