Most efficient way to store VARCHAR in 150B-row table

Question

We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:

Can be NULL
Highly duplicated but we can't de-dupe because it increases ingestion time exponentially
Average length is about 75 characters
It has to have an index as it will have to join with another table
We don't need to store it in human readable format but we need to be able to match it to another table which would have to have the same format for this column

I've tried the following:

Compressing the table, this helps with space but dramatically increases ingestion time, so I'm not sure compression is going to work for us
Tried hashing to SHA2 which reduced the string length to 56, which gives us reasonable space saving but just not quite enough. Also I'm not sure SHA2 will generate unique values for this sort of data
Was thinking about MD5 which would further reduce string length to probably the right level but not sure again whether MD5 is string enough to generate unique values to be able to match with another table

MD5 may well cause you issues with table collisions (of two seperate values giving the same MD5 hash). What sort of space savings are you trying to achieve? — Martin, Jul 09 '16 at 14:18
is `ENUM` an option depending on how many duplicates you expect to have? — Martin, Jul 09 '16 at 14:19
90% duplication so there is still 15 billion unique values. ENUM is not an option. If MD5 is problematic how about SH2? — L4zl0w, Jul 09 '16 at 16:13
with that quantity of unique values then there's a strong chance that MD5 hashing will have collisions and so will not be suitable. I don't know enough about SH2 but you can research the probability of SHA2 collisions and [see this post too](http://crypto.stackexchange.com/questions/3049/are-there-any-known-collisions-for-the-sha-1-2-family-of-hash-functions). — Martin, Jul 09 '16 at 16:20
What are the maximum and minimum sizes of the string (when not null)? It might be useful to know what percentage of strings are <50,50-120,>120 characters, say. — Strawberry, Jul 09 '16 at 17:27
MD5 collision is extremely unlikely. If you have 9 trillion MD5s, there is only one chance in 9 trillion that there will be a collision. (You have a mere 15 billion, so the odds are much less.) — Rick James, Jul 12 '16 at 00:09
What percentage are NULL? Is the avg of 75 over all rows? Or just the non-null rows? — Rick James, Jul 12 '16 at 01:16
We have been using SHA2 but MD5 would save space. Although we have 150 billion rows. We have 768GB RAM. NULL fields are rare, probably half a million. — L4zl0w, Jul 12 '16 at 07:10
I found that you can UNHEX() the hash and store it in half the bytes. Is UNHEX safe though if I need to do joins on this value? — L4zl0w, Jul 12 '16 at 07:12
To normalize 15 billion values (avg len 75), plus id and overhead, will take more than a terabyte. That is, the ingestion will slow to a crawl before beginning about one-third of the way through - because of I/O. That is, the 768GB will keep it humming for a long time, but not the entire time. — Rick James, Jul 13 '16 at 16:33

score 1 · Answer 1 · edited May 23 '17 at 12:08

A hash function like MD5 produces a 128-bit hash in a string of 32 hex characters, but you can use UNHEX() to cut that in half to 16 binary characters, and store the result in a column of type BINARY(16). See my answer to What data type to use for hashed password field and what length?

MD5 has 2¹²⁸ distinct hashes, or 340,282,366,920,938,463,463,374,607,431,768,211,456. The chances of two different strings resulting in a collision is pretty reasonably low, even if you have 15 billion distinct inputs. See How many random elements before MD5 produces collisions? If you're still concerned, use SHA1 or SHA2.

I'm a bit puzzled by your attempts to use a hash function, though. You must not care what the original string is, since you must understand that hashing is not reversible. That is, you can't get the original string from a hash.

I like the answer from @Data Mechanics, that you should enumerate the unique string inputs in a lookup table, and use a BIGINT primary key (a INT has only 4+ billion values so it isn't large enough for 15 billion rows).

I understand what you mean that you'd have to look up the strings to get the primary key. What you'll have to do is write your own program to do this data input. Your program will do the following:

Create an in-memory hash table to map strings to integer primary keys.
Read a line of your input
If the hash table does not yet have an entry for the input, insert that string into the lookup table and fetch the generated insert id. Store this as a new entry in your hash table, with the string as the key and the insert id as the value of that entry.
Otherwise the hash table does have an entry already, and just read the primary key bigint from the hash table.
Insert the bigint into your real data table, as a foreign key, along with other data you want to load.
Loop to step 2.

Unfortunately it would take over 1 TB of memory to hold a HashMap of 15 billion entries, even if you MD5 the string before using it as the key of your HashMap.

So I would recommend putting the full collection of mappings into a database table, and keep a subset of it in memory. So you have to do an extra step around 3. above, if the in-memory HashMap doesn't have an entry for your string, first check the database. If it's in the database, load it into the HashMap. If it isn't in the database, then proceed to insert it to the database and then to the HashMap.

You might be interested in using a class like LruHashMap. It's a HashMap with a maximum size (which you choose according to how much memory you can dedicate to it). If you put a new element when it's full, it kicks out the least recently referenced element. I found an implementation of this in Apache Lucene, but there are other implementations too. Just Google for it.

thanks for your answer. Very interesting. We are not interested in the value of the filed at all, we need the value only to be able to match with other entities based on this field. I guess UNHEX()ing the MD5 would save a lot of space but create even more invalid matches. You gave a bit more hope on using the MD5 but I'm still nervous about it. I like the LruHashMap idea as well but we won't be able to hold the entire map in memory and to query a database of that size for missing values will take ages even if indexing is properly done. — L4zl0w, Jul 09 '16 at 18:04
Yeah, that's the cost of working on a data set that's 100x the size that people normally think of as "big." — Bill Karwin, Jul 09 '16 at 21:58
Your in-memory hash will work fine until it overlows. The rest will be miserably slow due to the random nature of hashes and the need for I/O. — Rick James, Jul 12 '16 at 01:02
The `UNHEX` of MD5 or SHA2 is the only practical way to store them. It uses half the bytes, without any loss of precision. — Rick James, Jul 12 '16 at 01:05

Rick James · Accepted Answer · 2016-07-13T16:56:15.057

Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99).

INT (4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT takes 8 bytes. BINARY(5) and DECIMAL(11,0) are 5 bytes each, but are messier to deal with.

But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!

How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)

The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.

Building index as you insert, even with the benefit of InnoDB's "change buffer", will eventually slow down to not much faster than 1 disk hit per row inserted. Are you using SSDs? (Spinning drives are rated about 10ms/hit.) Let's say you can get 10K hits (inserts) per second. That works out to 15M seconds, which is 6 months.
Building the index after loading the entire table... This effectively builds a file with 150 billion lines, sorts it, then constructs the index in order. This may take a week, not months. But... It will require enough disk space for a second copy of the table (probably more) during the index-building.

So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?

How will the load be done?

Multiple LOAD DATA calls (probably best)? Single-row INSERTs (change "2 days" to "2 weeks" at least)? Multi-row INSERTs (100-1000 is good)?
autocommit? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT.)
Single threaded (perhaps cannot go fast enough)? Multi-threaded (other issues)?
My discussion of high-speed-ingestion.

Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.

Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.

See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.

Plan B

It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.

The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.

I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.

So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).

Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.

I tried compressing the table but ingestion was too slow. We are doing LOAD DATA INFILE. I haven't tried compressing the field only though, this I will try. — L4zl0w, Jul 12 '16 at 07:23
Also my approach is LOAD DATA INFILE, then build primary key index, then put this table into a partitioned table with EXCHANGE PARTITION. This will become the table with 150b rows. — L4zl0w, Jul 12 '16 at 07:27
Not `ROW_FORMAT=compressed`; instead compress that one field in the _client_. (I assume the client is on a different machine than mysqld?) — Rick James, Jul 12 '16 at 15:08
If you are `LOADing` into InnoDB, it is better to (1) sort according to the desired PK, (2) have the PK defined, (3) do the load. What is the PK? — Rick James, Jul 12 '16 at 15:10
Downvoting as indexing duplicates and compression techniques to reduce size is inefficient compared to storing a value once where it is duplicated and simply providing an integer index. — Datadimension, Jul 12 '16 at 19:49
@DataMechanics I agree what you are saying in a normal case. But I have 150 billion records and if I want to load this data within a reasonable timeframe I have to take the bullet and store duplicates. Rick has very valid points. — L4zl0w, Jul 12 '16 at 20:28
A database doesnt know a 'normal' case - no such thing. It processes as its underlying programming dictates. Its up to design to compliment what the specific database architecture offers. Maybe could could provide a sample row(s) to look at exactly what data is there. I previously have inherited a project processing (relatively small in this case 35M records) from a CSV dump to database which took 10 mins to locate a record in raw C. It took a redesign into about 5 relational tables which with appropriate indexing allowed php to use the data in a webpage in about 2s. Efficiency is designed IN. — Datadimension, Jul 12 '16 at 20:47
@DataMechanics - 15B rows dedupped in the normalization table will blow out RAM and become I/O-bound before finishing the load. That is, the dataset is too big for "simply providing an integer index". — Rick James, Jul 12 '16 at 21:59
Exactly, the dataset is too big. Way too big.While you give some slight (and wise) suggestions this issue is still massive data. My point is that the dataset is stated as very much duplicated. How can you not agree that efficiency is best by a value of say '3' instead of indexing a standard hash length of (simplest) 40 chars - CHARS not an integer value. The time take for redesign is not asked here - the question was EFFICIENCY - so both in query processing and resource efficiency. Therefore duplication is NOT an option and nor is hashing when a plain value can be referenced relationally — Datadimension, Jul 12 '16 at 22:09
So it will take time to redesign and optimise the data but that is surely the choice of the designer - more important shouild be the end result and efficiency of accessing the data by each and every process. So yes it depends on how much work the EU wants to do to get a better design or a best design. There is no ;shortcut' optimisation I am aware of. — Datadimension, Jul 12 '16 at 22:14
I do agree that "3" is better. That's a given. Beyond that, I am looking at how difficult it will be to get there. — Rick James, Jul 12 '16 at 23:47
We have already tested breaking out the entity and it would simply takes too long to insert the data i.e. over 100 days. We also tested what duplicated records will mean in storage. We are doing UNHEX(SHA2(string,224)) and store it BINARY(28). This is the best approach for us so far. In terms of indexes, we don't have a PRIMARY key as we will only need to join this table based on this duplicated field. What we've done though is break the table into separate tables and tables into partitions. — L4zl0w, Jul 13 '16 at 06:44
The string we are talking about is a mail router address in the format of string1@string2 both strings can have different lenghts, and very random values. — L4zl0w, Jul 13 '16 at 06:47
Thanks a lot for your help Rick, I think I have enough to go with. — L4zl0w, Jul 13 '16 at 22:28

score -2 · Answer 3 · answered Jul 09 '16 at 14:19

-2

You state its highly duplicated ? My first thought would be to create another table with the actual varchar value and a primary int key pointing to this value.

Then the existing table can simply change to contain as a foreign key the reference to this value (and additionally be efficiently index able).

answered Jul 09 '16 at 14:19

Datadimension

872
1
12
31

We've tried this but we then need to look up the int key when we ingest the current table which dramatically slows down the ingestion process. – L4zl0w Jul 09 '16 at 16:46
Then your ingestion process is badly flawed. For a start if database has duplicate values then it needs better design. To reduce space by compression / encryption(?) methods multiple times for the same value will do nothing other than needlessly increase ingestion. You will never get a compression / encryption algorithm that is quicker than simply using an int as a value - its impossible. – Datadimension Jul 09 '16 at 17:11
For example I can copy and paste this entire page into a database TEXT field and reference it by another integer field indexed by a single integer. The TEXT field is stored ONCE. All references are a single indexed integer. You will not improve on that for space or efficiency. – Datadimension Jul 09 '16 at 17:21
Could you give an example of what you said in your answer? Do I need to first INSERT all the unique occurances of this VARHCAR into a separate table and then when I INSERT my main table look the integer value up and then insert it into the main table? Or is there a better way of doing it? – L4zl0w Jul 09 '16 at 17:37
This is getting silly - its basic relational database design - how can I give an example when you have provided non of the required varchar fields as an example yourself. Hence downvoting. – Datadimension Jul 09 '16 at 17:40
Lol. Why are you here if you don't want to help? – L4zl0w Jul 09 '16 at 17:41
Its not that I don't want to help - indeed I have given you standard database design info - but you are looking at this incorrectly, interestingly this is agreed by your second answer by Bill Karwin. We both agree on the answer. Indeed if you encrypt it you will create overhead but also not be able to retrieve the varchar without even more unnecessary overhead by having to decrypt it for each duplicate entry. If you hash it you will not get the value back. The downvote therefore still stands. – Datadimension Jul 09 '16 at 17:48
1

@DataMechanics - Encryption does not shrink the size. It is of no use in this context. MD5 is a "cryptograpic hash function". – Rick James Jul 12 '16 at 01:10
1

`FOREIGN KEY` constraints add overhead. I recommend not using them for this application. – Rick James Jul 12 '16 at 01:13

Most efficient way to store VARCHAR in 150B-row table

3 Answers3