26

We've got a healthy debate going on in the office this week. We're creating a Db to store proxy information, for the most part we have the schema worked out except for how we should store IPs. One camp wants to use 4 smallints, one for each octet and the other wants to use a 1 big int,INET_ATON.

These tables are going to be huge so performance is key. I am in middle here as I normally use MS SQL and 4 small ints in my world. I don't have enough experience with this type of volume storing IPs.

We'll be using perl and python scripts to access the database to further normalize the data into several other tables for top talkers, interesting traffic etc.

I am sure there are some here in the community that have done something simular to what we are doing and I am interested in hearing about their experiences and which route is best, 1 big int, or 4 small ints for IP addresses.

EDIT - One of our concerns is space, this database is going to be huge like in 500,000,000 records a day. So we are trying to weigh the space issue along with the performance issue.

EDIT 2 Some of the conversation has turned over to the volume of data we are going to store...that's not my question. The question is which is the preferable way to store an IP address and why. Like I've said in my comments, we work for a large fortune 50 company. Our log files contain usage data from our users. This data in turn will be used within a security context to drive some metrics and to drive several security tools.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
OhioDude
  • 1,060
  • 2
  • 12
  • 16
  • 5
    @OhioDude: As for now, `Google` gets about `200,000,000` hits a day. I think preparing for `500,000,000` hits a day is exactly what's called "premature optimization". – Quassnoi Jul 10 '09 at 12:09
  • 1
    Yea, but one hit to a web page does not produce one line in a log file. Each image and resource on the web-site has it's own line in the proxy logs. – OhioDude Jul 10 '09 at 12:29
  • 1
    Even if you have 100 image/css/javascript files per page (not considering that those kind of resources tend to be heavily cached by the browser), you'd still be only one or two orders of magnitude away from Google. Storing IP numbers would be one of your smaller problems... – Benjamin Wohlwend Jul 10 '09 at 12:37
  • This doesn't sound like the sort of thing that SQL databases are the best choice for. Can you give any information about why you want to use an SQL database as opposed to some other storage method? – SpoonMeiser Jul 10 '09 at 12:45
  • 1
    @SpoonMeiser: 500M rows a day makes about 6K rows a second which is not that impossible, provided they are batched right. – Quassnoi Jul 10 '09 at 12:52
  • Trust me on the volume guys. Consider all the proxy logs for a fortune 50 company all coming together in one single place. The logs we are getting are a day old. Once we get them we do some awk vodoo to clean them up and normalize them as much as possible. Then off to perl to write into the database. – OhioDude Jul 10 '09 at 14:25
  • @piquadrat - we're not storing hits on an internal site, we're storing proxy data from everyone in our company's web activity to external sites. Keep in mind, I work for a large fortune 50 company with offices all over the world. – OhioDude Jul 10 '09 at 14:30

7 Answers7

28

I would suggest looking at what type of queries you will be running to decide which format you adopt.

Only if you need to pull out or compare individual octets would you have to consider splitting them up into separate fields.

Otherwise, store it as a 4 byte integer. That also has the bonus of allowing you to use the MySQL built-in INET_ATON() and INET_NTOA() functions.

Performance vs. Space

Storage:

If you are only going to support IPv4 addresses then your datatype in MySQL can be an UNSIGNED INT which only uses 4 bytes of storage.

To store the individual octets you would only need to use UNSIGNED TINYINT datatypes, not SMALLINTS, which would use up 1 byte each of storage.

Both methods would use similar storage with perhaps slightly more for separate fields for some overhead.

More info:

Performance:

Using a single field will yield much better performance, it's a single comparison instead of 4. You mentioned that you will only run queries against the whole IP address, so there should be no need to keep the octets separate. Using the INET_* functions of MySQL will do the conversion between the text and integer representations once for the comparison.

Nisse Engström
  • 4,738
  • 23
  • 27
  • 42
Andre Miller
  • 15,255
  • 6
  • 55
  • 53
  • Most of our queries will pull aggregates of the ip addresses. For example, how may distinct IP address hit this remote site. Or how many sites did this specific host hit within this 8 hour period. – OhioDude Jul 10 '09 at 11:42
  • If you store it as a 4 byte integer, be sure not to hard code that assumption anywhere else. Why? [IPv6](https://stackoverflow.com/questions/420680/how-to-store-ipv6-compatible-address-in-a-relational-database). Make sure you have a clear upgrade path. – Schwern Jan 01 '15 at 18:35
14

A BIGINT is 8 bytes in MySQL.

To store IPv4 addresses, an UNSINGED INT is enough, which I think is what you shoud use.

I can't imagine a scenario where 4 octets would gain more performance than a single INT, and the latter is much more convenient.

Also note that if you are going to issue queries like this:

SELECT  *
FROM    ips
WHERE   ? BETWEEN start_ip AND end_ip

, where start_ip and end_ip are columns in your table, the performance will be poor.

These queries are used to find out if a given IP is within a subnet range (usually to ban it).

To make these queries efficient, you should store the whole range as a LineString object with a SPATIAL index on it, and query like this:

SELECT  *
FROM    ips
WHERE   MBRContains(?, ip_range)

See this entry in my blog for more detail on how to do it:

John Topley
  • 113,588
  • 46
  • 195
  • 237
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
5

Use PostgreSQL, there's a native data type for that.

More seriously, I would fall into the "one 32-bit integer" camp. An IP address only makes sense when all four octets are considered together, so there's no reason to store the octets in separate columns in the database. Would you store a phone number using three (or more) different fields?

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
3

Having seperate fields doesn't sound particularly sensible to me - much like splitting a zipcode into sections or a phone number.

Might be useful if you wanted specific info on the sections, but I see no real reason to not use a 32 bit int.

Rich Bradshaw
  • 71,795
  • 44
  • 182
  • 241
3

for both ipv4 and ipv6 compatibility, use VARBINARY(16) , ipv4's will always be BINARY(4) and ipv6 will always be BINARY(16), so VARBINARY(16) seems like the most efficient way to support both. and to convert them from the normal readable format to binary, use INET6_ATON('127.0.0.1'), and to reverse that, use INET6_NTOA(binary)

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
1

Efficient transformation of ip to int and int to ip (could be useful to you): (PERL)

sub ip2dec {
    my @octs = split /\./,shift;
    return ($octs[0] << 24) + ($octs[1] << 16) + ($octs[2] << 8) + $octs[3];
}

sub dec2ip {
    my $number = shift;
    my $first_oct = $number >> 24;
    my $reverse_1_ = $number - ($first_oct << 24);
    my $secon_oct = $reverse_1_ >> 16;
    my $reverse_2_ = $reverse_1_ - ($secon_oct << 16);
    my $third_oct = $reverse_2_ >> 8;
    my $fourt_oct = $reverse_2_ - ($third_oct << 8);
    return "$first_oct.$secon_oct.$third_oct.$fourt_oct";
}
user105033
  • 18,800
  • 19
  • 58
  • 69
  • 2
    I didn't give you the downvote, but I guess its a bit superfluous to do this in Perl if his database already has this functionality. – Andre Miller Jul 10 '09 at 15:50
0

Old thread, but for the benefit of readers, consider using ip2long. It translates ip into an integer.

Basically, you will be converting with ip2long when storing into DB then converting back with long2ip when retrieving from DB. The field type in DB will INT, so you will save space and gain better performance compared to storing ip as a string.

Roger
  • 63
  • 6