Organise multiple fixed-length-string dictionaries for "sparse" string-lengths

Question

I need to load some fixed-length-string dictionaries from disk: count of word-lengths ( 1 .. 18 ), word-lengths ( 3 .. 20 ), word-counts ( 1 .. N ) for each word-length ( N < 50,000 ) are all unknown until run-time.

I'm new to C and just about coping with array-of-string:

char (*words5)[6]; // my word-lengths don't count \0 ...to preserve sanity
words5 = calloc(wordcnt * sizeof *words5);

That's fine for one dictionary, but I will have 1 .. 18 different dictionaries. It seems "untidy" to have:

char (*words3)[4];
char (*words4)[5];
...
char (*words20)[21];

when most will be unused for most runs of the programme.

What is a better way of "organising" these dictionaries?

Chris

@Craig Estey: Not sure that trie does what I want! My intended use of the dictionaries is along the lines of grab the next N (N = +/-10) 5-letter words that match: regex-like filter: ".T[EA].." (e.g. "STACK", "STAKE", "STABS", etc) && "character-budget" filter: ( max 2*'A', 0*'B', 0*'C', 1*'D' ...etc ) "STACK", "STABS" eliminated. && other filters: e.g. wannabe candidate word not in hashmap of disallowed words. I understand the storage efficiency of a trie, but "next N": how do I arrange that: how do I mark my next "index" in a trie? Chris — chrisM, Apr 15 '22 at 20:49
It _can_ be done with trie _if_, in addition to the the parent/child[26] links, you have next/prev for all trie nodes at a given depth. But ... what about `struct word { char *str; struct word *next; };` and then have: `struct word *wordlist[100];` indexed by length of word? So, for all 5 letter words: `for (struct word *word = wordlist[5]; word != NULL; word = word->next) // do stuff` (i.e. each `wordlist` entry is a pointer to a single linked list)? That's just one [simple] example. What are your needs? — Craig Estey, Apr 15 '22 at 20:53
If you don't mind tree (trie) _traversal_, for (e.g.) a 5 letter word the resume point is the address of the last trie node at the correct depth (e.g. depth 4) that you saved into the output list. For example, if the last word you added to the 10 word output list was "batch", then remember the address of the "h" node. When you resume, at "h" go to parent (i.e. "c") and then down to the next [valid] child of "c" — Craig Estey, Apr 15 '22 at 21:23
@CraigEstey Wow! I've been fascinated by the trie for quite a while but never implemented one in any language. Seriously interesting but also seriously scary for a C newbie. I desperately need my dictionary loader running so I can work on the main algorithm so I will lash up an array of pointers for now, and get back to trie later. Thank you for the idea and the "next" suggestions: I will pursue it. — chrisM, Apr 16 '22 at 07:31
I do have some simple trie examples lying around: https://stackoverflow.com/questions/62235829/i-am-trying-to-initialize-a-glogal-structuretrie-node-as-the-head-of-a-trie-in/62236036#62236036 and https://stackoverflow.com/questions/62121662/implement-n-ary-tree-in-c/62159300#62159300 — Craig Estey, Apr 16 '22 at 14:50

chrisM · Answer 1 · 2022-05-04T20:31:20.760

I finally got it going!

edit: OH NO I DIDN'T! -- this seems to do ONE wordlen and segfaults when I do more stuff that generates (e.g.): thislen = 7; int thiscnt = 10;

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_WORDLEN 10 // maximum word-length for test ( sans \0 )

// bounds: index is word-length ( sans \0 ), int is word-count
int wordcnts[MAX_WORDLEN + 1] = {0}; // +1 to get index = MAX_WORDLEN

// dictionaries
char *dicts[MAX_WORDLEN + 1];   // +1 to get an index = MAX_WORDLEN
    
int main(void) {
    
    // do stuff that generates:
    int thislen = 4; // word-length of interest ( sans \0 )
    int thiscnt = 9; // count of thislen words surviving the filters
    

    char (*dicts[thislen])[thislen + 1]; // +1 for \0
    dicts[thislen] = malloc(thiscnt * sizeof(*dicts[thislen]));
    if ( dicts[thislen] == NULL ) {
        printf("malloc fail! len %d, cnt %d\n", thislen, thiscnt);
        exit(EXIT_FAILURE); 
    }
    wordcnts[thislen] = thiscnt;                                            
    

    // copy some dummy strings into dicts[thislen]
    strncpy(dicts[thislen][0], "ABLE", thislen + 1);
    strncpy(dicts[thislen][1], "BODY", thislen + 1);
    strncpy(dicts[thislen][2], "COPE", thislen + 1);
    strncpy(dicts[thislen][3], "DEER", thislen + 1);
    strncpy(dicts[thislen][4], "MANY", thislen + 1);
    strncpy(dicts[thislen][5], "MORE", thislen + 1);
    strncpy(dicts[thislen][6], "XIPE", thislen + 1);
    strncpy(dicts[thislen][7], "YELL", thislen + 1);
    strncpy(dicts[thislen][8], "ZOOS", thislen + 1);
    
    for (int i = 0; i < wordcnts[thislen]; ++i) {
        printf("[%d]: %s\n", i, dicts[thislen][i]);
    }

    // free when done
    for (int i = 0; i <= MAX_WORDLEN; ++i) {
        if (wordcnts[i] > 0) {
            free(dicts[i]);
        }
    }
    return 0;
}

/* OUTPUT
me@testbox Tests % ./malloc_strings1   
[0]: ABLE
[1]: BODY
[2]: COPE
[3]: DEER
[4]: MANY
[5]: MORE
[6]: XIPE
[7]: YELL
[8]: ZOOS
me@testbox Tests %
*/

Organise multiple fixed-length-string dictionaries for "sparse" string-lengths

1 Answers1