2

I'm looking into writing a program that takes a text, some wildcard patterns and then shows me the words that match those wildcards. The wildcard pattern contains only ., which represents only one character, or *, which can be anything (including a white space, a new line or ?!,./\ etc.).

Until now, I managed to dynamically read the text and I'm thinking into creating a list with all the words with strtok. Only with ., it'd be easy, but I have no idea how to work with *. And expressions, of course, can be a combination of . and *, like this: h.*. (which can be a match for harry for example).

I'd like you to share some ideas with me. I'm not asking for explicit full code, but for ideas to implement it myself.

Adrian Pop
  • 1,879
  • 5
  • 28
  • 40
  • 1
    The character `*` by itself is not a valid regular expression, in a regex it's a modifier character that means "zero or more of the previous expression". It's a valid expression when [*globbing*](https://en.wikipedia.org/wiki/Glob_%28programming%29) on the other hand. – Some programmer dude Dec 01 '15 at 13:10
  • In this problem, if an expression would be only *, it should show me all the words. It's a combination of regular expression and globbing, but only with two wildcards. – Adrian Pop Dec 01 '15 at 13:12
  • 1
    And don't try to make a regular expression engine all by your self, not unless you want to get your hands dirty with finite state machines and automatas and regular languages. Not even a simple globbing system is, well, simple. For globbing, there are functions available, either as part of the C runtime library or as external libraries (depending on system). On POSIX systems (like e.g. Linux or OSX) you could use [`fnmatch`](http://man7.org/linux/man-pages/man3/fnmatch.3.html) for globbing. – Some programmer dude Dec 01 '15 at 13:16
  • You can build a simple *wildcard* matching scheme for * and ? in a few dozen lines of C. A *regular expression* search is considerably harder to implement. I'm sure your university task is the former. Check the exact requirements before delving in. – Bathsheba Dec 01 '15 at 13:22
  • Okay, for only `.` (or `?`) and `*` it should be a little simpler. But there are still lots of cases that will be hard, like will `"* bar"` match e.g. `"hello foo bar"` or only `"foo bar"`, for example. You could actually find some source for a `scanf` function, and see how it handles pattern matching. – Some programmer dude Dec 01 '15 at 13:22
  • I tried to translate what I need in English from my language. Wildcard matching scheme sounds much better :) – Adrian Pop Dec 01 '15 at 13:30
  • (Meandering thought only) If `.` stands for 'one single character' and `*` for 'one or more', then your sample sequence `h.*.` can be simplified to just `h*`. Do you want to have suffix functionality as well? I.e., `h*y` – literal characters *after* a wildcard? – Jongware Dec 01 '15 at 13:47
  • 1
    @Jongware - No, the first requires at least three characters and the second only two. – owacoder Dec 01 '15 at 13:48
  • The expressions I have to search for are purely random. Of course, h.*. and h* are equivalent for harry, it was just an example :) Thanks anyone for your help! – Adrian Pop Dec 01 '15 at 14:00
  • I wrote an implementation in C# you way wish to see. It has numerous features such as multiple queries on the trot, case insensitive support, and efficiency: https://stackoverflow.com/a/54820605/848344 – Dan W Feb 22 '19 at 06:12

3 Answers3

4

There's a 2001 IOCCC one-liner (schweikh, ahem) that does globbing with * for "zero or more characters" and ? for "exactly one character".

Figuring out how that one works is probably enlightening and gives you plenty of ideas.

Jens
  • 69,818
  • 15
  • 125
  • 179
3

Here is pretty simple implementation:

int WildcardCompare(const char* wild, const char* string) 
{
    const char* cp = NULL, *mp = NULL;
    
    while ((*string) && (*wild != '*')) 
    {
        if ((*wild != *string) && (*wild != '?')) 
        {
            return 0;
        }
        wild++;
        string++;
    }
        
    while (*string) 
    {
        if (*wild == '*') 
        {
            if (!*++wild) 
            {
                return 1;
            }
            mp = wild;
            cp = string+1;
        } 
        else if ((*wild == *string) || (*wild == '?')) 
        {
            wild++;
            string++;
        } 
        else 
        {
            wild = mp;
            string = cp++;
        }
    }
        
    while (*wild == '*') 
    {
        wild++;
    }
    return !*wild;
}
Dada
  • 6,313
  • 7
  • 24
  • 43
Andrew Komiagin
  • 6,446
  • 1
  • 13
  • 23
  • Thanks! I'll take a look at this as well :) – Adrian Pop Dec 01 '15 at 13:58
  • I just tried it and there are some...errors. First of all, I get "error: assignment of read-only variable ‘mp’". If I remove "const", it's ok, but a combination of stars and question-marks are fatal for the program. For example, I tried with string = "lampa" and wild="??*mpa" which should return 1, but I get an Segmentation fault error. I'll try to fix something if I can :) – Adrian Pop Dec 02 '15 at 03:45
0

Take a look at flex and bison. They should help you with tokenizing and then parsing the regex. From there it should be fairly easy to perform the matching.

doron
  • 27,972
  • 12
  • 65
  • 103