Index on column with 70% of empty values: Use null or empty value?

Question

We are currently optimizing a MySQL table (InnoDB) that will eventially have more than 100 million rows.

In one column, we are storing IP addresses (VARCHAR 45). We need to put an index on this column, as we have to be able to retrieve all rows per specific IP address.

70% of all rows, however, will not store an IP address (empty).

Our question: Shall we store those empty values as NULL and thus ALLOW NULL on this column (will add 1 byte to each row). Or shall we NOT ALLOW NULL and store those empty values as '' (empty string)?

What is best for performance?

We will never have to search rows that are empty (= '') or null (IS NULL), only search for specific IP addresses (= '123.456.789.123').

Update: There are indeed many questions on SO that address similar scenarios. However, some answers seem to be contradictory or say "it depends". We will run some tests and post our findings for our specific scenario here.

I would imagine the empty string would be slightly more performant purely as it uses less storage space. The index would be basically the same either way. The best solution is the one you TEST and verify is quicker — Grantly, Dec 19 '15 at 14:24
Possible duplicate of [MySQL: NULL vs ""](http://stackoverflow.com/questions/1106258/mysql-null-vs) — Shadow, Dec 19 '15 at 14:30
@Shadow Yes, seems like a similar question - but at first glance it seems to me the two highest-scoring answers say the opposite? One says "use null", the other one says "don't use null!". — Mörre, Dec 19 '15 at 14:32
The 2 highest scoring answers actually don't say definite yes or no. The 3rd answer is definite about indexing. — Shadow, Dec 19 '15 at 14:38
@Lionel why don't you make the IP address field varbinary(16) and use mysql's inet6_aton() function to convert string representations into binary representations? http://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_inet6-aton — Shadow, Dec 19 '15 at 14:42
@Shadow Because we are stuck with MySQL 5.6 at the moment. Also, we want the IPs to remain human readible. We thus decided to go with VARCHAR for now. — Lionel, Dec 19 '15 at 14:44
@Shadow The 3rd answer in the article says "However, if you’re planning to index columns, avoid making them nullable if possible. There are exceptions, of course. For example, it’s worth mentioning that InnoDB stores NULL with a single bit, so it can be pretty space-efficient for sparsely populated data." So it seems to still not be clear in our case. — Lionel, Dec 19 '15 at 14:46
@Lionel 1. Inet6_aton() is available in v5.6 and using inet6_ntoa() you can easily convert the numeric form back to human readable. 2. In optimization related questions you very rarely get straight answers. You are not going to get one here either. The other topic lists all points you need to consider, then you need to evaluate in your specific environment and with your data what works better. 3. What is more important to you: speed or data storage? — Shadow, Dec 19 '15 at 14:51
@Shadow 1. Yes, it is indeed available starting from MySQL 5.6.3, you are right. We are stuck at 5.6.19 though. 2. Ok, we will do some tests and post our results to our specific case here. Thanks! — Lionel, Dec 19 '15 at 14:56
Do check InnoDB vs MyISAM as they are handling the NULL's differently — Lavi Avigdor, Dec 21 '15 at 12:15

score 2 · Answer 1 · answered Dec 19 '15 at 23:42

2

VARCHAR(39) is sufficient for both IPv4 (the old format, for which there are no more values available) and IPv6.

The optimizer may screw up if 70% of the values are the same ('' or NULL). I suggest you have another table with the IP and an ID for JOINing back to your original table. By having no 'empty' IPs in the second table, the optimizer is more likely to "do the right thing".

With that, LEFT JOIN can be used to see if there is an IP.

IPv6 can be stored in BINARY(16) to save space.

answered Dec 19 '15 at 23:42

Rick James

135,179
13
127
222

If the second table didn't have any empty IP addresses you would have to use null as the foreign key to it, which gets you back where you started. – user207421 Dec 20 '15 at 01:20
That's an argument against FKs. They aren't useful in all situations. – Rick James Dec 20 '15 at 22:32
1

@EJP You misunderstood. Rick is suggesting a has one relationship, where the new table has a reference back to the original. There would be no IP or IP_id column in the original table. – Arth May 24 '16 at 12:29

score 1 · Answer 2 · answered Mar 25 '16 at 16:56

1

Go with NULL values. InnoDB has no space cost for NULLs, and NULL values are excluded from indexes, so you'll have a faster index lookup for the values which are present.

As far as how you store the IP itself (string verus number), that seems like a far less important point of optimization.

answered Mar 25 '16 at 16:56

manchicken

156
6

1

Interesting claim, but without proper references, it's for me hard approve. – Jonathan Parent Lévesque Jul 11 '18 at 14:01

score 0 · Answer 3 · answered Dec 21 '15 at 12:04

The main difference between NULL and an empty string is related to comparing values. Two empty strings are considered equal. Two NULL values are not. For example, if you want to join two tables based on IP-value columns, the result will be quite different for NULL and empty strings, and most likely you want the behavior of NULL.

If you only are going to search for specific IP-adresses, using NULL or empty string should not matter. If the IP-value column is indexed, the optimizer will obtain an estimate from InnoDB on the number of rows with the specific value. The general statistics on number of rows per value will not be used in this case.

Avoiding NULL values will save you 30 MB on 100 million rows when 70% of the rows are NULL. (For rows where the value is an empty string, you will not save any space since you will need one byte to store the length information instead.) Compared to what you can save by storing IP values as a binary string, this is nothing, and I do not think storage overhead is a valid concern.

The space cost of `NULL` values is only relevant in MyISAM. InnoDB has no space cost for `NULL`s. — manchicken, Mar 25 '16 at 16:54
InnoDB row headers contains a bit vector over columns that are NULL. If there are no NULL columns, the row header will not contain this bit vector. Hence, a table without NULL columns will use 1 byte less per row than the same table with 1-8 NULL columns. See https://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html — Øystein Grøvlen, Apr 05 '16 at 07:32

Index on column with 70% of empty values: Use null or empty value?

3 Answers3