6

I have a list of memory addresses from 0xc0003000 to 0xc04a0144 there are many gaps and < 4096 entries in the list. It is known at compile time and I want to make a perfect hash for it.

However looking up perfect hashing online gives me information mostly related to hashing strings and they don't seem to translate well.

To be clear I want to be able to obtain the memory address at run time and check that it is in the hash quickly. Currently I am using a binary search which is on average about 8 loops to find the answer.

Any ideas what tree I should bark up?

Digital Powers
  • 460
  • 6
  • 23
  • How about balanced trees, like B-tree or red-black ? – Rsh Jul 26 '12 at 20:23
  • I think the radix tree is the best search tree for sparse integer value search. – Dmitry Poroh Jul 26 '12 at 20:24
  • 4
    Can't you just use the address itself as a 4-byte string? gperf has support for working on non-NUL-terminated strings: http://www.gnu.org/software/gperf/manual/gperf.html#Binary-Strings – Alan Curry Jul 26 '12 at 20:28
  • I have a list of integer values and I want to check if a run time generated value is in that list. I will look at a radix tree, bitset is c++ this is strait c. I believe a balanced tree is the same as a binary search for performance, which is what i am hoping to beat. – Digital Powers Jul 26 '12 at 20:52
  • 2
    `gperf` sounds like exactly what you need. – pmr Jul 26 '12 at 20:54
  • Radix tree works good with memory pages (all addresses within the page have the same address prefix). – Dmitry Poroh Jul 26 '12 at 21:13
  • I am trying to use gperf in a binary manner but I don't see how that is supposed to be done, any ideas? the -l flag doesnt seem to do what I expect – Digital Powers Jul 26 '12 at 22:49
  • I've done a little test run of the `gperf -l` approach. I got it to work, but I'm not sure it'll be what you want. It does extra comparisons because it expects to work with strings of various lengths. There doesn't seem to be a way to tell it that the string will always be exactly 4 bytes. It might still be more efficient than some of the alternatives though. Should I post my code as an answer? – Alan Curry Jul 27 '12 at 03:13
  • yeah please do, I got some okish results from -l but the problem I was running into was that my binary data occasionally contains 0x0a which is a newline and it sees that as a delimiter. I tried changing the delimiter with the -e flag to no avail. – Digital Powers Jul 27 '12 at 15:04
  • As a side note I am going to give a radix tree a shot, i will benchmark and see how the performance compares to the binary search. – Digital Powers Jul 27 '12 at 16:04

1 Answers1

3

Here's a sample gperf program. I included a NUL and a newline in the sample data to prove that they don't cause it to fail.

%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>
#include <arpa/inet.h>
%}
%%
"\xc0\x01\x02\x03"
"\xc0\xff\xff\xff"
"\xc0\xff\x00\xff"
"\xc0\x0a\xff\xff"
%%
int main(int argc, const char **argv)
{
    int i;

    for(i=1;i<argc;++i) {
        uint32_t addr = ntohl(strtoul(argv[i], 0, 16));
        if(in_word_set((char *)&addr, 4))
            printf("0x%08"PRIx32" is in the list.\n", htonl(addr));
        else
            printf("0x%08"PRIx32" is not in the list.\n", htonl(addr));
    }
    return 0;
}

Save as addrs.gperf, compile and test with

gperf -l addrs.gperf > addrs.c
gcc addrs.c -o addrs
./addrs c0000000 c0010203 c0ffffff c00affff c0ff0aff c0ffff00 c0ff00ff
Alan Curry
  • 14,255
  • 3
  • 32
  • 33
  • It would look a lot cleaner, and run a little bit faster, if gperf was actually designed to be used for this purpose. – Alan Curry Jul 27 '12 at 20:41
  • 1
    This works well for what I was doing and is about 40% faster than the binary search (10,000,000 loops). The radix tree ended up being about equal to the binary search, it was marginally better. – Digital Powers Jul 27 '12 at 21:50