3

I have fixed-length arrays like this:

char x[10] = "abcdefghij";   // could also contain non-printable char

How to efficiently test if this array is present or not in a fixed set of 10,000 arrays of the same length? (note: these arrays are either all of them binary data arrays, or all of them null-terminated strings, this is fixed at the beginning)

In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):

s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S)  # True or False

How to do the equivalent in C? (not C++)


Note: the linked question How to check if a string is in an array of strings in C? is about a naive way to do it, with no performance requirement (they loop over all strings and use strcmp). Here it's different since there are 10k arrays, one needs to use another method for performance (a hashtable maybe?).

Basj
  • 41,386
  • 99
  • 383
  • 673
  • 1
    There is no standard hashtable/set/dictionary in C or any similar alternative. You have a choice: a) implement it yourself from scratch (hashtable, trie, whatever); b) use an array of strings and manually check whether the needle is somewhere in it; c) find a library which provides a "set of strings" abstraction efficiently. – yeputons May 14 '22 at 17:49
  • Does this answer your question? [How to check if a string is in an array of strings in C?](https://stackoverflow.com/questions/13677890/how-to-check-if-a-string-is-in-an-array-of-strings-in-c) – niry May 14 '22 at 17:50
  • @niry No, this linked question is a naive way to do it, with no performance requirement (they loop over all strings and strcmp). Here it's really different, since there are 10k strings, one needs to use another method for performance (hashtable maybe?) – Basj May 14 '22 at 17:57
  • 1
    Note that `char x[10] = "abcdefghij";` lacks a `NUL` terminator, so it isn't a "string". It would have one if you didn't restrict the array size - don't do that. You can't use string handling functions on this array. – Weather Vane May 14 '22 at 17:59
  • [How should character arrays be used as strings?](https://stackoverflow.com/questions/58526131/how-should-character-arrays-be-used-as-strings) – Lundin May 14 '22 at 18:08
  • @WeatherVane Yes I know, I spoke about strings to simplify, but more generally it can be fixed-length char arrays (non necessarily printable characters). In the case of strings, of course, I use null termination. – Basj May 14 '22 at 18:17
  • If you have a sorted array, a binary search using `memcmp()` would be reasonably efficient. Or if you packed it down to 64 bits, more so. – Weather Vane May 14 '22 at 18:33
  • @Basj In the answers of https://stackoverflow.com/questions/838404/implementing-a-hashmap-in-c you might find some inspiration. – Franck May 14 '22 at 18:38
  • @WeatherVane Good idea. Generally it takes 128 bits, can we do an efficient binary search to compare if a given 128-bit element is present in a sorted array of 10k 128-bit elements? – Basj May 14 '22 at 18:38
  • 3
    A binary search of 10000 elements only needs about 13 data comparisons, so it's not going to make a huge difference. – Weather Vane May 14 '22 at 18:41
  • @WeatherVane Good idea. I think you can post an answer with this, I'll accept it. On 64bit architecture, can we compare 128 bit in just 1 function call? or does it have to loop over each char of the array? – Basj May 14 '22 at 18:43
  • @WeatherVane Note: some chars in the array can be `\0x00`, it's binary data, so it's not null-termination. – Basj May 14 '22 at 18:53
  • That's why I suggested `memcmp()` but you could make two 64-bits comparisons (of union members) without calling any functions. – Weather Vane May 14 '22 at 18:59
  • @WeatherVane I did it with `memcmp` in the meantime too, but I'm interested for your second solution (two 64 bit comparison of union members), would be great if you post an answer with this+binary search! – Basj May 14 '22 at 19:02
  • @WeatherVane I posted an answer, feel free to comment if you see a better way (without `memcmp`?) – Basj May 14 '22 at 20:17
  • If you are looking for specifically strings, maybe a [prefix-tree](https://stackoverflow.com/a/71867641/2472827) would be appropriate? – Neil May 18 '22 at 07:16
  • @Neil Yes good idea. Here I'm looking primarily for a general algorithm working for binary data arrays. – Basj May 18 '22 at 07:32

3 Answers3

2

Note about storing both null-terminated strings and binary data in an array

In your question, you specify that the fixed-length arrays could represent

  • null-terminated strings or
  • binary data.

If your program has no way of knowing which one of these two possibilities is represented by the array contents, then, if the array contents is supposed to represent a null-terminated string, you will have to fill the remainder of the array with null characters (for example using the function strncpy). It won't be sufficient to only write a single null terminating character.

This is important, because if you don't do this, your program will have no way of knowing how to interpret the remaining data after it encounters a byte with the value 0. It won't know whether to

  • interpret this byte as a null-terminating character of a string and ignore all remaining bytes of the string, or

  • interpret this byte as binary data and treat the remaining bytes also as binary data (i.e. not ignore the remaining bytes).

However, if you always fill all remaining bytes after a terminating null character of a string with null characters, then your program won't habe to worry about whether the array contents represents a string or binary data, because it can treat both the same way.

Therefore, in the remainder of my answer, I will assume that if the array contents represents a string, then all remaining bytes after the end of the string will be filled with bytes with the value 0. That way, I can assume that no bytes should ever be ignored.

Hash table solution

In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):

It is also possible to use a hash table in C, although you will have to program it yourself or use an already existing library. The C standard library does not provide any hashing functions.

The following code will randomly generate an array of 10,000 fixed-length (not null-terminated) strings of length 10 with the characters a to z, A to Z and 0 to 9, but it will also place three hard-coded words into the array in different places, so you can search for these words later, in order to test the search function. The program will then insert all words into a hash table, and then perform several hard-coded lookups into the hash table.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10
#define HASH_TABLE_SIZE 10000

struct hash_table_entry
{
    struct hash_table_entry *next;
    unsigned char data[ARRAY_LENGTH];
};

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
size_t hash_array( const unsigned char array[ARRAY_LENGTH] );
void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //declare hash table and initialize all fields to zero
    static struct hash_table_entry *hash_table[HASH_TABLE_SIZE] = {NULL};

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //build hash table
    printf( "Building hash table..." );
    fflush( stdout );
    build_hash_table_from_data( data, NUM_ARRAYS, hash_table );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_lookup( (unsigned char *)"uvwxyz0123", hash_table );
    verbose_lookup( (unsigned char *)"hsdkesldhg", hash_table );
    verbose_lookup( (unsigned char *)"abcdefghij", hash_table );
    verbose_lookup( (unsigned char *)"erlsodn3ls", hash_table );
    verbose_lookup( (unsigned char *)"klmnopqrst", hash_table );

    //cleanup
    printf( "Performing cleanup..." );
    fflush( stdout );
    cleanup_hash_table( hash_table );
    printf( "done\n" );
    fflush( stdout );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

//This is a simple hash function. Depending on the type of data, a
//more complex hashing function may be required, in order to prevent
//hash collisions.
size_t hash_array( const unsigned char array[ARRAY_LENGTH] )
{
    size_t hash = 0;

    for ( size_t i = 0; i < ARRAY_LENGTH; i++ )
    {
        hash = hash * 7 + array[i];
    }

    return hash % HASH_TABLE_SIZE;
}

void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < data_length; i++ )
    {
        struct hash_table_entry **pp, *p;

        //hash the array and store the result
        size_t hash = hash_array( data[i] );

        //perform hash table lookup
        pp = &hash_table[hash];

        //make pp point to the last null pointer in the list
        while ( ( p = *pp ) != NULL )
            pp = &p->next;

        //allocate memory for linked list node
        p = malloc( sizeof *p );
        if ( p == NULL )
        {
            fprintf( stderr, "Memory allocation failure!\n" );
            exit( EXIT_FAILURE );
        }

        //fill in data for new node
        p->next = NULL;
        memcpy( p->data, data[i], ARRAY_LENGTH );

        //insert new node into linked list
        *pp = p;
    }
}

void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < HASH_TABLE_SIZE; i++ )
    {
        struct hash_table_entry *p;

        p = hash_table[i];

        while ( p != NULL )
        {
            struct hash_table_entry *q = p;

            p = p->next;

            free( q );
        }
    }
}

bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    size_t hash;
    struct hash_table_entry *p;

    //find hash
    hash = hash_array( array );

    //index into hash table
    p = hash_table[hash];

    //attempt to find array in linked list
    while ( p != NULL )
    {
        if ( memcmp( array, p->data, ARRAY_LENGTH ) == 0 )
        {
            //array was found
            return true;
        }

        p = p->next;
    }

    //array was not found
    return false;
}

//This function serves as a wrapper function for the
//function "lookup". It prints to "stdout" what it is
//looking for and what the result of the lookup was.
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( lookup( array, hash_table ) )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

This program has the following output:

Generating random data...done
Building hash table...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found
Performing cleanup...done

As you can see, it found all 3 strings that were explicitly inserted, and no other strings were found.

In the code, I am using unsigned char instead of char, because in C, a char * is usually used for passing null-terminated strings. Therefore, it seemed more appropriate to use unsigned char * for passing data that could be binary or not null-terminated.

bsearch solution

For comparison, here is a solution which generates the input array in the same way, but uses bsearch instead of a hash table for searching it:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
int compare_func( const void *a, const void *b );
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //sort the array
    printf( "Sorting array..." );
    fflush( stdout );
    qsort( data, NUM_ARRAYS, ARRAY_LENGTH, compare_func );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_search( (unsigned char *)"uvwxyz0123", data );
    verbose_search( (unsigned char *)"hsdkesldhg", data );
    verbose_search( (unsigned char *)"abcdefghij", data );
    verbose_search( (unsigned char *)"erlsodn3ls", data );
    verbose_search( (unsigned char *)"klmnopqrst", data );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

int compare_func( const void *a, const void *b )
{
    return memcmp( a, b, ARRAY_LENGTH );
}

//This function serves as a wrapper function for the
//function "bsearch". It prints to "stdout" what it is
//looking for and what the result of the search was.
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( bsearch( array, data, NUM_ARRAYS, ARRAY_LENGTH, compare_func ) != NULL )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

This program has the following output:

Generating random data...done
Sorting array...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found

The hash table solution is probably faster than the binary search solution though, assuming that a good hash function is selected for the input, so that the number of hash collisions is minimal.

Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
  • Thanks a lot. I'm going to test this! PS: the arrays are either all of them binary data, or all of them null-terminated strings, this is a fixed decision at the beginning, so this is no problem in reality. – Basj May 18 '22 at 06:56
  • @Basj: I have now changed `struct hash_table_entry` to contain the actual data, instead of only a pointer to the data. That way, this pointer does not have to be dereferenced. This should improve performance a bit. – Andreas Wenzel May 18 '22 at 07:12
  • Thanks @AndreasWenzel. (In my use case, I only need to test if it is present or not very fast. In the case it is present, I can perform another longer test to actually retrieve the array) – Basj May 18 '22 at 07:31
  • @Basj: I have now added a solution which uses `bsearch`. – Andreas Wenzel May 18 '22 at 07:56
  • Thanks @AndreasWenzel, great solutions here! Hashtable is 50% faster than binary search for my application! Do you think there is even faster? Without storing the actual `data` in the `hash_table_entry` ; not because of RAM usage which is negligible for 10k elements but for performance reason. Or a way to quickly check if it is present or not (with a few false positive, bloom-filter style)? When positive (very rarely) I do a double check with this hashtable search or binary search. – Basj May 18 '22 at 12:32
  • @Basj: I believe the current hashtable solution, which stores the actual data in the hash table instead of a pointer, should generally be faster, even if `struct hash_table_entry` is now 50% larger (due to padding) than it was before. My previous solution in [revision 5](https://stackoverflow.com/revisions/72283036/5) of my answer used a pointer instead, which was smaller, but this pointer had to be dereferenced every time, so that an additional memory access was necessary. Therefore, I believe that this version should generally be a bit slower. [...] – Andreas Wenzel May 18 '22 at 23:32
  • @Basj: [continuation of previous comment] However, I was unable to confirm this in actual benchmarking. It probably depends on the layout of the memory assigned by `malloc`. – Andreas Wenzel May 18 '22 at 23:33
  • @Basj: When dealing with a hash table, the most important thing is to ensure that the hash function is good, i.e. that the hash function does not create too many hash collisions. Otherwise, if you have many hash collisions, you will be traversing long linked lists, which is highly inefficient. In my tests of my posted code, the maximum number of hash collisions for a single hash table entry were 6 to 8 (depending on the random data), which I believe is an acceptable value. However, depending on the input, the hash function may have to be modified. – Andreas Wenzel May 18 '22 at 23:42
  • For a slight improvement you should make HASH_TABLE_SIZE a multiple of 2. That way you can return hash & (HASH_TABLE_SIZE -1) instead of a slower modulo (hash % HASH_TABLE_SIZE) operation. – jmq May 19 '22 at 00:54
  • @jmq: I'm not sure if it is generally a good idea to make the hash table size a power of 2, as that could increase the chance of a hash collision, depending on the type of input and what kind of hash function you choose. – Andreas Wenzel May 19 '22 at 02:23
  • 1
    @Basj: Your idea of using a probabalistic data structure such as a bloom filter is already how a hash table works. If there is no hash table entry (i.e. the hash table pointer points to `NULL` for that hash value), then you can be sure that the word does not exist in the hash table (no chance of a false negative). However, if the hash table pointer points to non-`NULL` for that hash value, then that word may exist in the hash table, but there is a chance of a false positive. [continued in next comment] – Andreas Wenzel May 19 '22 at 02:42
  • 1
    @Basj: [continued from previous comment] That is why it is necessary to traverse the entire linked list for that hash value in order to determine whether it is a false positive or not. Using a [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) has the advantage that it requires less space than a hash table. – Andreas Wenzel May 19 '22 at 02:42
  • Thanks @AndreasWenzel and jmq for your great comments. – Basj May 19 '22 at 07:12
1

The signature of the bsearch function is as follows:

   void *bsearch(const void *key, const void *base,
                 size_t nmemb, size_t size,
                 int (*compar)(const void *, const void *));

Here, key would be either t1 or t2 and base would be T. nmemb and size would be 9 and 4 respectively.

compar is a pointer to a callback function to do the comparison. This can just be a wrapper around memcmp:

int compare_char4(const void *p1, const void *p2)
{
    return memcmp(p1, p2, 4);
}

Then you call bsearch with these parameters:

printf("%s\n", bsearch(t1, T, 9, 4, compare_char4) ? "found" : "not found");
printf("%s\n", bsearch(t2, T, 9, 4, compare_char4) ? "found" : "not found");
dbush
  • 205,898
  • 23
  • 218
  • 273
  • Thanks @dbush! It works great, and much simpler than my solution (which tries to reinvent the wheel)! In terms of performance it seems to be similar to my solution. – Basj May 18 '22 at 07:39
0

I think the following binary search works (thanks to @WeatherVane) if the array of arrays is sorted, then the complexity is O(log n). We can use memcmp for the comparison.

int binarysearch(unsigned char *T, unsigned char *t, int num, int size)  // https://en.wikipedia.org/wiki/Binary_search_algorithm#Algorithm
{
    int a = 0;
    int b = num-1;
    int m, res;
    while (a <= b) {
        m = (a + b) / 2;
        res = memcmp(&T[0] + m*size, &t[0], size);
        if (res < 0) {
            a = m + 1;
        } else if (res > 0) {
            b = m - 1;
        } else { 
            return 1;
        } 
    }
    return 0;
}
int main()
{
    unsigned char T[][4] = { 
        { 0x00, 0x6e, 0xab, 0xcd },
        { 0xea, 0x6e, 0xab, 0xcd },
        { 0xeb, 0x6e, 0xab, 0xcd },
        { 0xec, 0x6e, 0xab, 0xcd },
        { 0xed, 0x6e, 0xab, 0xcd },
        { 0xef, 0x6e, 0xab, 0xcd },
        { 0xfa, 0x6e, 0xab, 0xcd },
        { 0xfb, 0x6e, 0xab, 0xcd },
        { 0xff, 0x6e, 0xab, 0xcd }
    };
    unsigned char t1[4] = { 0x33, 0x6e, 0xab, 0xcd };
    unsigned char t2[4] = { 0xfb, 0x6e, 0xab, 0xcd };
    printf("%s\n", binarysearch(T[0], t1, 9, 4) ? "found" : "not found");
    printf("%s\n", binarysearch(T[0], t2, 9, 4) ? "found" : "not found");
}    

Result:

not found
found

Basj
  • 41,386
  • 99
  • 383
  • 673