0

I have about 60 billion records recorded. It has an average size of 980 GB and all in text file. (lines similar to md5 code) Is it possible to transfer them to database? I wonder if I run into a problem!

Your Common Sense
  • 156,878
  • 40
  • 214
  • 345

2 Answers2

1

Of course it's possible. You will need to write a script to load the data from whatever format it is in inside your text file into SQL that can be executed against the database server.

The problems you are likely to run into:

  1. If you are using auto-increment, make sure you make it a bigint.

  2. It will take a long time to load.

To achieve reasonable performance, your loading program should use a prepared statement with bound parameters and commit in batches of 1000-10000.

On a modern server with high clock speed and NVMe storage, you might see 10,000 INSERTs per second. At that speed it will take you about 70 days to load your data set.

To mitigate failures, you should probably look into creating the table partitioned and splitting the loading into batches in such a way that each batch goes into a different partition. That way if a batch fails, you can truncate that one partition and re-run the failed batch.

If you are not familiar with databases and working with data at this scale, you should probably seriously consider hiring a professional with such experience to help you out with this for the ~3 months (or possibly as little as a few weeks, depending on how well stocked their bag of tricks is) it'll take to parse and load all of that data.

Gordan Bobić
  • 1,748
  • 13
  • 16
  • I noticed that you can load very fast with "LOAD DATA INFILE" Method. 24 Core 48 th 192 Ram 1 TB NVE, and I'll do it on my server Just before I started I doubted mysql would allow this – Robet Kuruzo Sep 20 '20 at 14:30
  • @RobetKuruzo don't be surprised of mysql utterly chokes on this operation with 60bn rows and a terabyte of data. Additionally, you fill find that it it slows down dramatically once your transaction log fills up. – Gordan Bobić Sep 20 '20 at 16:14
1

MySQL can store a table of 60 billion rows. See my answer to Maximum number of records in a MySQL database table

But I would be very surprised if 980GB of CSV data fits on a 1TB storage volume after you import it to a MySQL table. I did an experiment: I created an InnoDB table for MD5 hashes and loaded it with 1,000,000 rows.

create table myhashes (hash char(32) primary key);

The result was a tablespace that is 68MB in size. You would think that 1 million strings of 32 characters each would be exactly 32 x 1,000,000. But it uses more than twice that amount of storage.

InnoDB stores a table as a B-tree of the primary key values. So there's a lot of overhead. Also it stores data in pages, but does not fill the pages 100% full. There is also some metadata per row. In short, InnoDB takes more space than your CSV data.

Even if your CSV data were loaded 1-for-1 into a database, you're going to fill your storage volume too full. Solid-state storage performance degrades if you fill it too much. At my job, engineers are alerted if any volume goes over 90% full, and then we need to take action to move some data to another server. Not only because we don't want the volume to fill to 100% but also because of the degraded performance issue.

If your storage were larger, could MySQL handle 60 billion rows? Yes it could, but it probably wouldn't be happy. When we see a table with one-tenth of that number of records, it's time to think about splitting it up, probably storing it on several servers. We have some very large databases, for example one app that splits a 424 terabyte dataset over 400 servers. We chose 1TB as an easy round number for the practical limit for a dataset. There's nothing magic about 1TB, but the larger a database gets, the harder it is to manage. So we require large datasets to be split up.

I have to wonder if MySQL is the best database product for you. I don't know what you're going to do with 60 billion md5 hashes. What kinds of queries do you need to do on this dataset? It's not clear if you have more columns in your CSV data, to associate each md5 hash with some other attribute data. Do you need to index those columns for searches?

Depending on what kinds of queries you need to do, it could be more efficient to use a scalable key/value store like Cassandra or ScyllaDB or DynamoDB. These databases automatically split the data for you, so you can query it as if it's in one database, even though it's stored on multiple servers.

I have to comment that if this is your first beginner project with database management, you might have bitten off more than you can chew. You should start by learning how to manage a modestly sized database before you push the boundaries of scale.

It's like you're just learning to ride your first motorcyle, and you want to jump over 100 buses on your first ride. Can the motorcycle do it? Maybe, but you're probably not ready to do it.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • I figured it was some kind of a rainbow table. – Gordan Bobić Sep 20 '20 at 18:55
  • @BillKarwin Please forgive my inexperience. MD5 was just one example. As far as I understand yes, mysql can store 6 million records of course if I have the necessary hardware. I understood what you were talking about and the information you shared was very useful for me. I am grateful for your help. – Robet Kuruzo Sep 21 '20 at 02:34