-1

I have a .txt file, It contains the ipv6 address range and some other information, like this:

minip                     maxip                  border      ...
2001::  2001:0:ffff:ffff:ffff:ffff:ffff:ffff    domestic    nan Hurricane Electric LLC  nan nan nan nan nan nan
2001:1::    2001:1:ffff:ffff:ffff:ffff:ffff:ffff    outbound    nan nan nan nan nan nan nan nan
2001:2::    2001:3:ffff:ffff:ffff:ffff:ffff:ffff    outbound    nan nan nan nan nan nan nan nan
2001:4::    2001:4:ff:ffff:ffff:ffff:ffff:ffff  outbound    nan nan nan nan nan nan nan nan
2001:4:1000::   2001:4:1fff:ffff:ffff:ffff:ffff:ffff    outbound    nan nan nan nan nan nan nan nan

It contains millions of records, I want to create a hash map use the records. And other modules can query the hash table and get corresponding information by given an ipv6 address. The problem is How can i generate hash key from the minip and maxip, eg: minip = 2001::, maxip = 2001:0:ffff:ffff:ffff:ffff:ffff:ffff, and the query ipv6 address is 2001:0:50:124:70:65:80:214. If I generate the hash key with high 4 Bytes, there can be a long chain of conflicts and the query efficiency is low. eg:

2001:250:2::    2001:250:2:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:3::    2001:250:3:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:4::    2001:250:4:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:5::    2001:250:5:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:6::    2001:250:7:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:8::    2001:250:f:ffff:ffff:ffff:ffff:ffff domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:10::   2001:250:1f:ffff:ffff:ffff:ffff:ffff    domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:20::   2001:250:3f:ffff:ffff:ffff:ffff:ffff    domestic    nan  nan  nan  nan  86  110000  110000  51
2001:250:40::   2001:250:5f:ffff:ffff:ffff:ffff:ffff    domestic    nan  nan  nan  nan  86  110000  110000  51

Also, I can't use every class C network number as a Hash key like IPv4 does. IPv6 address is 128 bits, and ipv6 does not have a clear network division. If you use the first 8 Bytes as network numbers, there may be many network numbers in each record belonging to [minip, maxip]. At the same time, this does not guarantee that the conflict chain is relatively short, and will cause a lot of memory consumption. How can I design a hash map so that memory consumption is within acceptable limits and queries are as efficient as possible?

Da Wang
  • 100
  • 7
  • "_ipv6 does not have a clear network division_" That is backwards. IPv4 does not have a clear network division, but IPv6 networks (with a couple of exceptions for loopbacks and point-to-point links) use `/64` networks (64-bit Network and 64-bit IID). – Ron Maupin Aug 26 '23 at 14:26
  • @Da Wang, IMHO, treat the address as a 128-bit unsigned (or `^` the 2 halves to form a 64-bit one) and then `%` by the hash table size - be sure the size is a [prime](https://stackoverflow.com/a/32915815/2410359). – chux - Reinstate Monica Aug 26 '23 at 17:22
  • "_Also, I can't use every class C network number as a Hash key like IPv4 does._" Network address classes are dead (please let them rest in peace), killed in 1993 (two years _before_ the commercial Internet in 1995) by RFCs1517, 1518, and 1519, which defined CIDR (_Classless_ Inter-Domain Routing). There have been no network address classes in this century. Also, the Class C range was only a small part of the IPv4 address range (`192.0.0.0` to `223.255.255.255`), so how do you hash other IPv4 addresses? – Ron Maupin Aug 26 '23 at 19:32
  • You seem to have some IP address misinformation. Also, remember that you are discussing the text representations of IP addresses, not the actual IP addresses, which are 32-bit unsigned integers for IPv4 and 128-bit unsigned integers for IPv6. It is perfectly easy to hash string text as a key in a hash table. – Ron Maupin Aug 26 '23 at 19:36
  • @ Ron Maupin "Also, I can't use every class C network number as a Hash key like IPv4 does." means that for ipv4 address, we use the higher three bytes serve as the hash key – Da Wang Aug 27 '23 at 05:20
  • @RonMaupin yes, i treat ip addresses as as 32-bit unsigned integers for IPv4 and 128-bit unsigned integers for IPv6. not the text representations of IP addresses. – Da Wang Aug 27 '23 at 05:23

1 Answers1

1

I think you're trying to use the IP address directly as a hash index. As you surmised, this might be doable for ipv4, but with ipv6 it becomes unwieldy.

I'd treat the ipv6 address (16 bytes) as a fixed 16 byte string, not worrying [too much] about which part is the network part and the address part.

Then, the problem is [somewhat] similar to a word/dictionary lookup.

Just so we're on the same page, here is one of my cs50 speller answers: cs50 pset5 Speller optimisation

Note that my final version there uses an allocation pool of node descriptors. This helps minimize memory fragmentation.

And, each node retains the hash value, so when searching the linked list of nodes for a given hash table entry, it first compares the 32 bit hash value rather than the full string, only comparing the full string if the hash values match.


The hash function in the link is somewhat simple. The quality of the hash function is key: we want a hash that evenly distributes the values across the buckets. And, we want the hash function to be fast.

A very useful resource was eternallyconfuzzled.com It no longer exists but is archived:

  1. Tutorials
  2. Hash Tables
  3. Introduction to Hashing

One of the better hash functions I've found comes from perl [as it does a lot of hashing, it needs to be good]. It is actually the one-at-a-time hash (by Bob Jenkins). And, it is considered to be state of the art. Here is a reimplementation from some of my own code:

// strhash -- calculate a string hash
uint32_t
strhash(const void *bf,size_t bflen)
{
    const uint8_t *bp;
    const uint8_t *bfe;
    uint32_t hash;

    bp = bf;
    bfe = bp + bflen;

    hash = 0x37FB26C5;

    for (;  bp < bfe;  ++bp) {
        hash += *bp;
        hash += hash << 10;
        hash ^= hash >> 6;
    }

    hash += hash << 3;
    hash ^= hash >> 11;
    hash += hash << 15;

    return hash;
}

All of the above uses a single hash table lookup.

We might do better with nested hash tables. A top hash table using the network part. Then, each network entry points to a separate hash table for node part.

Then, each hash node only needs to hold the hash entry and an 8 byte value (vs. a 16 byte value).


UPDATE:

thanks! But eg: minip = 2001::,maxip = 2001:0:ffff:ffff:ffff:ffff:ffff:ffff, and the query ip address = 2001:0:50:124:70:65:80:214. It is observed that query ip address Is within the scope of [minip, maxip]. if i treat the ipv6 address (16 bytes) as a fixed 16 byte string,Can I query the result correctly? “2001:0:50:124:70:65:80:214” is not match "2001::" or "2001:0:ffff:ffff:ffff:ffff:ffff:ffff" . – Da Wang

Remember that the 16 byte "fixed string" that I was referring to is the binary address (e.g. returned by inet_pton) and not the ASCII [human readable] representation from the file (e.g. 2001::).

If we need to match on an IP address range, we can not use a hash lookup.

We'd need to use another organization. Probably a balanced binary tree approach (e.g. AVL or red-black).

Personally, I just use red-black trees because they work well even with a lot of dynamic updates to the tree. Supposedly, AVL trees are better if the actual tree is initialized once and not updated dynamically.

But, it either case, you'll need to read in the file data and put it into some sort of struct. Then, you'll need to provide a compare function to the tree implementation.

Here's a sample implementation:

struct iprange {
    uint8_t minip[16];
    uint8_t maxip[16];
    byte isaddr;
    ...
};

// ipcmp -- compare IP addresses against range
// RETURNS: -1=addr<minip, 0=match, +1=addr>maxip
int
ipcmp(const struct iprange *addr,const struct iprange *rnge)
// addr -- pointer to address we wish to find the range for
// rnge -- pointer to address range
{
    int cmp;

    do {
        // compare address against lowest address of the range
        cmp = memcmp(addr->minip,rnge->minip,16);
        if (cmp < 0)
            break;

        // compare address against highest address of the range
        cmp = memcmp(addr->minip,rnge->maxip,16);
        if (cmp > 0)
            break;

        // match -- we are within the specified range
        cmp = 0;
    } while (0);

    return cmp;
}
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • thanks! But eg: minip = 2001::,maxip = 2001:0:ffff:ffff:ffff:ffff:ffff:ffff, and the query ip address = 2001:0:50:124:70:65:80:214. It is observed that query ip address Is within the scope of [minip, maxip]. if i treat the ipv6 address (16 bytes) as a fixed 16 byte string,Can I query the result correctly? “2001:0:50:124:70:65:80:214” is not match "2001::" or "2001:0:ffff:ffff:ffff:ffff:ffff:ffff" . – Da Wang Aug 27 '23 at 05:45