0

Given a set of byte-representable symbols (e.g. characters, short strings, etc), is there a way to find a mapping from that set to a set of consecutive natural numbers that includes 0? For example, suppose there is the following unordered set of characters (not necessarily in any particular character set).

'a', '(', ''

Is there a way to find a "hash" function of sorts that would map each symbol (e.g. by means of its byte representation) uniquely to one of the integers 0, 1, and 2, in any order? For example, 'a'=0, '('=1, ''=2 is just as valid as 'a'=2, '('=0, ''=1.

Why?

Because I am developing something for a memory-constrained (think on the order of kiB) embedded target that has a lot of fixed reverse-lookup tables, so something like std::unordered_map would be out of the question. The ETL equivalent etl::unordered_map would be getting there, but there's quite a bit of size overhead, and collisions can happen, so lookup timings could differ. A sparse lookup table would work, where the byte representation of the symbol would be the index, but that would be a lot of wasted space, and there are many different tables.

There's also the chance that the "hash" function may end up costing more than the above alternatives, but my curiosity is a strong driving force. Also, although both C and C++ are tagged, this question is specific to neither of them. I just happen to be using C/C++.

Mona the Monad
  • 2,265
  • 3
  • 19
  • 30
  • 2
    What is "C/C++"? – Evg Sep 01 '22 at 02:42
  • How big is the range of possible values for your symbols, and how big is the range for your possible mapped integers? Are you just mapping between [0,# of symbols)? Because, depending on the ratio of those two ranges, you might just want to use modulo (assuming the entire range is filled, or at least uniformly sparse) – William Fleetwood Sep 01 '22 at 02:59
  • *"is there a way to find a mapping"* - you are wording this like the mapping is something to "find" rather than just create arbitrarily. You have a finite set of elements, it will always be possible to create a mapping of such kind. The problem is the complexity you want this function to have. If you want it to be O(1) that'd be pretty hard to achieve, and in general impossible if you cannot use hashtables/maps. Not sure you can do much better than an `unordered_map`. – Marco Bonelli Sep 01 '22 at 03:31
  • 3
    If you know them all ahead of time you can use something like `gperf` to generate a perfect hash table with no collisions. – Shawn Sep 01 '22 at 03:38
  • Or maybe use a radix tree, ham trie or other bit based lookup structure. – Shawn Sep 01 '22 at 03:42
  • Looks like `gperf` is pretty much exactly what I'm looking for. I'd have to play around with it to see whether it can generate something appropriate for my target, though. – Mona the Monad Sep 01 '22 at 13:38
  • @Evg By "C/C++" I mean "C but a little C++ for some language conveniences". – Mona the Monad Sep 01 '22 at 13:57
  • My question might be a duplicate of https://stackoverflow.com/questions/16141178/is-it-possible-to-map-string-to-int-faster-than-using-hashmap, though the `asso_values` table in that one seems a bit large. Then again, that OP has strings, and I have fixed-length "keys". – Mona the Monad Sep 01 '22 at 15:02

2 Answers2

0

The normal way to do things like this, for example when coding a font for a custom display, is to map everything to a sorted, read-only look-up table array with indices 0 to 127 or 0 to 255. Where symbols corresponding to the old ASCII table are mapped to their respective index. And other things like your banana symbol could be mapped beyond index 127.

So when you use FONT [97] or FONT ['a'], you end up with the symbol corresponding to 'a'. That way you can translate from ASCII strings to your custom table, or from your source editor font to the custom table.

Using any other data type such as a hash table sounds like muddy program design to me. Embedded systems should by their nature be deterministic, so overly complex data structures don't make sense most of the time. If you for some reason unknown must have the data unordered, then you should describe the reason why in detail, or otherwise you are surely asking an "XY question".

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • I envision the "hash table" in an ideal case to be only as large as `N * sizeof(table_element)`, where `N` is the symbol count. I used quotes there because it'd be more like passing a symbol (not necessarily a 8-bit character) to a function to get an index into that table. I could have used a `switch` and called it a day, because timing isn't *so* strict that I can't afford variable-time lookup like that, but I was curious whether constant-time lookup would be possible. `gperf` looks promising, but I'd have to see whether it'd be more overheady than keeping it simple by using a `switch`. – Mona the Monad Sep 01 '22 at 13:55
  • @MonatheMonad Again, _why_ do you need a huge dynamic table/ADT created in run-time? Instead of focusing on various solutions, focus on the actual problem you are trying to solve. – Lundin Sep 01 '22 at 14:01
  • I don't need it to be created at runtime (my bad if I made it sound that way), because all the "keys" are known ahead of time, and are fixed. If you're asking about what all this is for, it's a sort of state machine (technically a parser, but not really). – Mona the Monad Sep 01 '22 at 14:08
  • @MonatheMonad Any you can't use a sorted `const` array because...? – Lundin Sep 01 '22 at 14:10
  • Size? For example, if the symbol size were 2 bytes (which can happen), and the table element size were 16 bytes, that'd be `65536 * 16`, or `max_key * 16` to omit keys above the maximum. For a bunch of sparse tables with only a handful of possible values, that'd be huge waste of space (in fact, it wouldn't fit in memory). – Mona the Monad Sep 01 '22 at 14:24
  • I'm being vague about what the keys are because I don't know what they are, at the moment. I just know that they're few and fixed, but also that there are many separate symbol sets. – Mona the Monad Sep 01 '22 at 14:25
  • @MonatheMonad I still have no clue what you are trying to do, but it sounds like you are mixing up the problem of storing your data with the problem of storing some integer value corresponding to each data entry. You can have two tables. Index 0 in the first table contains magic number 12345, index 0 in the second table contains a banana. And there you have it, 12345 equals the banana. – Lundin Sep 01 '22 at 14:33
  • That would imply the need for a search, and therefore would be variable-time, no? I don't really know how else to describe the problem, other than that [minimal] perfect hashing seems very much like what I'm looking for, if it can be constant-time. – Mona the Monad Sep 01 '22 at 14:43
0

Yes, there is such a map. Just put all of them in an array of strings... then sort it, and make a function that searchs for the word in the array and returns the index in the array.

static char *strings[] = {
    "word1", "word2", "hello", "world", NULL, /* to end the array */
};

int word_number(const char *word)
{
    for(int i = 0; strings[i] != NULL; i++) {
        if (strcmp(strings[i], word) == 0)
            return i;
    }
    return -1;  /* not found */
}

The cost of this (in space terms) is very low (considering that the compiler assigning pointers can optimice string allocation based on common suffixes (making a string overlap others if it is a common suffix of them) and if you give the compiler an already sorted array of literals, you can use bsearch() algorithm (which is O(log(n)) of the number of elements in the table)

static char *strings[] = { /* this time sorted */
    "Hello",
    "rella", /* this and the next are merged into positions on the same literal below
              * This can be controlled with a compiler option. */
    "umbrella",
    "world"
};
const int strings_len = 4;
int string_cmp(const void *_s1, const void *_s2)
{
    const char *s1 = _s1, *s2 = _s2;
    return strcmp(s1, s2);
}

int word_number(const char *word)
{
    char *result = bsearch(strings, 4, sizeof *strings, string_cmp);
    return result ? result - strings : -1;
}

If you want a function that gives you a number for any string, and maps biyectively that string with that number... It's even easier. First start with zero. For each byte in the string, just multiply your number by 256 (the number of byte values) and add the next byte to the result, then return back that result once you have done this operation with every char in the string. You will get a different number for each possible string, covering all possible strings and all possible numbers. But I think this is not what you want.

super_long_integer char2number(const unsigned char *s)
{
    super_long_integer result = 0;
    int c;
    while ((c = *s++) != 0) {
        result *= 256;
        result += c;
    } 
    return result;
}

But that integer must be capable of supporting numbers in the range [0...256^(maximum lenght of accepted string)] which is a very large number.

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31