How do I write a simple regular expression pattern matching function in C or C++?

Question

This is a question in my paper test today, the function signature is

int is_match(char* pattern,char* string)

The pattern is limited to only ASCII chars and the quantification * and ?, so it is relatively simple. is_match should return 1 if matched, otherwise 0.

How do I do this?

Are there parentheses in your regex or are the * and ? only applied to single characters? — A. Rex, Oct 28 '10 at 08:02
@A.Rex,no parentheses.the * and ? is the same meaning as defined by regex — Tracy, Oct 28 '10 at 08:06
@Tracy: There are a lot of different interpretations and implementations of regular expression matchers, and they don't all work identically. It is important that you understand which interpretations was meant, in order to answer this question correctly. — Merlyn Morgan-Graham, Oct 28 '10 at 08:07
Very interesting question! =) I thought it was simple for a second but then there are so many possibilities and combinations right? (I mean you can use combinations of * and ? as many times you like in the pattern correct?) — gideon, Oct 28 '10 at 08:14
@Tracy: This wasn't nearly as easy as I first thought it would be :) I wish you luck — Merlyn Morgan-Graham, Oct 28 '10 at 22:29
@Merlyn,yes,i just thought it is hard for me to do it,well,at least at that moment. — Tracy, Oct 29 '10 at 03:18

score 14 · Answer 1 · answered Jan 16 '16 at 05:59

Brian Kernighan provided a short article on A Regular Expression Matcher that Rob Pike wrote as a demonstration program for a book they were working on. The article is a very nice read explaining a bit about the code and regular expressions in general.

I have played with this code, making a few changes to experiment with some extensions such as to also return where in the string the pattern matches so that the substring matching the pattern can be copied from the original text.

From the article:

I suggested to Rob that we needed to find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and non-trivial class of patterns. Ideally, the code would fit on a single page.

Rob disappeared into his office, and at least as I remember it now, appeared again in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of TPOP. That code implements a regular expression matcher that handles these constructs:
c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character
This is quite a useful class; in my own experience of using regular expressions on a day-to-day basis, it easily accounts for 95 percent of all instances. In many situations, solving the right problem is a big step on the road to a beautiful program. Rob deserves great credit for choosing so wisely, from among a wide set of options, a very small yet important, well-defined and extensible set of features.

Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics.

The actual C source code from the article is very very nice.

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
    if (regexp[0] == '^')
        return matchhere(regexp+1, text);
    do {    /* must look even if string is empty */
        if (matchhere(regexp, text))
            return 1;
    } while (*text++ != '\0');
    return 0;
}

/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
    if (regexp[0] == '\0')
        return 1;
    if (regexp[1] == '*')
        return matchstar(regexp[0], regexp+2, text);
    if (regexp[0] == '$' && regexp[1] == '\0')
        return *text == '\0';
    if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
        return matchhere(regexp+1, text+1);
    return 0;
}

/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
    do {    /* a * matches zero or more instances */
        if (matchhere(regexp, text))
            return 1;
    } while (*text != '\0' && (*text++ == c || c == '.'));
    return 0;
}

score 6 · Answer 2 · edited May 23 '17 at 12:09

6

See This Question for a solution you can not submit. See this paper for a description of how to implement a more readable one.

edited May 23 '17 at 12:09

Community

1
1

answered Oct 28 '10 at 08:52

AShelly

34,686
15
91
152

1

wow... everyone of those answers' code makes my eyes bleed.. then Im unable to cry because I feel so insignificant and small! – gideon Oct 28 '10 at 09:18

Basilevs · Answer 3 · 2010-10-28T12:08:53.327

Here is recursive extendable implementation. Tested for first order of pattern complexity.

#include <string.h>
#include <string>
#include <vector>
#include <iostream>

struct Match {
  Match():_next(0) {}
  virtual bool match(const char * pattern, const char * input) const {
    return !std::strcmp(pattern, input);
  }
  bool next(const char * pattern, const char * input) const {
    if (!_next) return false;
    return _next->match(pattern, input);
  }
  const Match * _next;
};

class MatchSet: public Match {
  typedef std::vector<Match *> Set;
  Set toTry;
public:
  virtual bool match(const char * pattern, const char * input) const {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i) {
      if ((*i)->match(pattern, input)) return true;
    }
    return false;
  }
  void add(Match * m) {
    toTry.push_back(m);
    m->_next = this;
  }
  ~MatchSet() {
    for (Set::const_iterator i = toTry.begin(); i !=toTry.end(); ++i)
      if ((*i)->_next==this) (*i)->_next = 0;
  }
};

struct MatchQuestion: public Match  {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '?')
      return false;
    if (next(pattern+1, input))
      return true;
    if (next(pattern+1, input+1))
      return true;
    return false;
  }
};

struct MatchEmpty: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0]==0 && input[0]==0)
      return true;
    return false;
  }
};

struct MatchAsterisk: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    if (pattern[0] != '*')
      return false;
    if (pattern[1] == 0) {
      return true;
    }
    for (int i = 0; input[i] != 0; ++i) {
      if (next(pattern+1, input+i))
        return true;
    }
    return false;
  }
};

struct MatchSymbol: public Match {
  virtual bool match(const char * pattern, const char * input) const {
    // TODO: consider cycle here to prevent unnecessary recursion
    // Cycle should detect special characters and call next on them
    // Current implementation abstracts from that
    if (pattern[0] != input[0])
      return false;
    return next(pattern+1, input+1);
  }
};

class DefaultMatch: public MatchSet {
  MatchEmpty empty;
  MatchQuestion question;
  MatchAsterisk asterisk;
  MatchSymbol symbol;
public:
  DefaultMatch() {
    add(&empty);
    add(&question);
    add(&asterisk);
    add(&symbol);
  }
  void test(const char * p, const char * input) const {
    testOneWay(p, input);
    if (!std::strcmp(p, input)) return;
    testOneWay(input, p);
  }
  bool testOneWay(const char * p, const char * input) const {
    const char * eqStr = " == ";
    bool rv = match(p, input);
    if (!rv) eqStr = " != ";
    std::cout << p << eqStr << input << std::endl;
    return rv;
  }

};


int _tmain(int argc, _TCHAR* argv[])
{
  using namespace std;

  typedef vector<string> Strings;
  Strings patterns;

  patterns.push_back("*");
  patterns.push_back("*hw");
  patterns.push_back("h*w");
  patterns.push_back("hw*");

  patterns.push_back("?");
  patterns.push_back("?ab");
  patterns.push_back("a?b");
  patterns.push_back("ab?");

  patterns.push_back("c");
  patterns.push_back("cab");
  patterns.push_back("acb");
  patterns.push_back("abc");

  patterns.push_back("*this homework?");
  patterns.push_back("Is this homework?");
  patterns.push_back("This is homework!");
  patterns.push_back("How is this homework?");

  patterns.push_back("hw");
  patterns.push_back("homework");
  patterns.push_back("howork");

  DefaultMatch d;
  for (unsigned i = 0; i < patterns.size(); ++i)
    for (unsigned j =i; j < patterns.size(); ++j)
      d.test(patterns[i].c_str(), patterns[j].c_str());

    return 0;
}

If something is unclear, ask.

The `*` matches the previous symbol zero-or-more times, and `?` matches the previous symbol one-or-more times. In several places here, you have `?` and `*` with no preceding symbol. Is this intentional? — Merlyn Morgan-Graham, Oct 29 '10 at 02:25
* matches any amount of any symbols, ? matches zero or one symbol — Basilevs, Oct 29 '10 at 03:44
Now that I think about it, bash matches ? as exactly one symbol, I should've probably do the same — Basilevs, Oct 29 '10 at 03:56

score 2 · Answer 4 · edited Oct 28 '10 at 08:20

2

try to make a list of interesting test cases:

is_match("dummy","dummy") should return true;

is_match("dumm?y","dummy") should return true;

is_match("dum?y","dummy") should return false;

is_match("dum*y","dummy") should return true;

and so on ...

then see how to make the easier test pass, then the next one ...

edited Oct 28 '10 at 08:20

Chubsdad

24,777
4
73
129

answered Oct 28 '10 at 08:00

siukurnin

2,862
17
20

You trying to teach them TDD at the same time? ;) Why not post some simple instructions on how to use Google Test, while your at it: http://code.google.com/p/googletest/ – Merlyn Morgan-Graham Oct 28 '10 at 08:03
1

Plz correct me if Im wrong but...shouldn't `s_match("dum?y","dummy")` return **true**. `m?` matches the letter **m** zero or more times right? – gideon Oct 28 '10 at 08:21
@giddy: Zero or one times, see e.g. [here](http://www.regular-expressions.info/reference.html). `*` would mean *"zero or more times"*. – Georg Fritzsche Oct 28 '10 at 08:26
@gideon, no, `ab?` means that either `a` or `ab`. `ab?` will match `abbbb`, and `ab?` will match `ac`, but `ab?` will not match `aab` because there is no two `a` in the `ab?`. – FaranAiki Jan 24 '22 at 08:42

score 2 · Answer 5 · answered Oct 28 '10 at 08:06

2

Cheat. Use #include <boost/regex/regex.hpp>.

answered Oct 28 '10 at 08:06

Benoit

76,634
23
210
236

6

would be a perfect answer for a real world project : 1) never trust your boss when he says "we will never have a use for all other regex stuff, just ? and *, promised" 2) don't reinvent warm water when you have hot water on the tap. My answer was based on the asumption this is homework ... – siukurnin Oct 28 '10 at 08:15
Also, the questio is tagged C as well as C++. – JeremyP Oct 28 '10 at 09:58

score 1 · Answer 6 · answered Oct 28 '10 at 08:37

1

Didn't test this, actually code it, or debug it, but this might get you a start...

for each character in the pattern
  if pattern character after the current one is *
    // enter * state
    while current character from target == current pattern char, and not at end
      get next character from target
    skip a char from the pattern
  else if pattern character after the current one is ?
    // enter ? state
    if current character from target == current pattern char
      get next char from target
    skip a char from the pattern
  else
    // enter character state
    if current character from target == current pattern character
      get next character from target
    else
      return false
return true

answered Oct 28 '10 at 08:37

Merlyn Morgan-Graham

58,163
16
128
183

One important thing missing here is the ability to backtrack, otherwise the regex `ab?bc` couldn't match `abc`. – Tim Pietzcker Oct 28 '10 at 09:58
@Tim: I attempted to solve that by matching the `?` forward - "if the pattern character after the current one is ?" – Merlyn Morgan-Graham Oct 28 '10 at 16:15
But that doesn't work. The `b?` matches the `b`, then the next `b` can't match the `c` and the regex fails. There has to be a way to remember that the `b?` was optional, and to exercise that option. That's probably what Basilevs means by "add a stack here" - you need to remember all positions where there are several ways of applying the current regex token and try all possible combinations before declaring failure. – Tim Pietzcker Oct 28 '10 at 20:08
@Tim: I think there's some ambiguity to what I meant by "get next" vs "skip a char" vs "current". Maybe I'll just rewrite the darn thing in C =P I am fairly certain you don't need a stack, as long as you're willing to live with hacks that won't apply to more complicated features (that we currently aren't implementing) in the pattern. – Merlyn Morgan-Graham Oct 28 '10 at 22:23
@Tim: Oh, I get what you're saying now... Maybe I should get off my butt and go read a book on compilers, like I've been wanting to :) – Merlyn Morgan-Graham Oct 28 '10 at 22:28

Ivanvi · Answer 7 · 2010-10-28T11:12:13.300

The full power of regular expressions and finite state machines are not needed to solve this problem. As an alternative there is a relatively simple dynamic programming solution.

Let match(i, j) be 1 if it is possible to match the the sub-string string[i..n-1] with the sub-pattern pattern[j, m - 1], where n and m are the lengths of string and pattern respectively. Otherwise let match(i, j) be 0.

The base cases are:

match(n, m) = 1, you can match an empty string with an empty pattern;
match(i, m) = 0, you can't match a non-empty string with an empty pattern;

The transition is divided into 3 cases depending on whether the current sub-pattern starts with a character followed by a '*', or a character followed by a '?' or just starts with a character with no special symbol after it.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

int is_match(char* pattern, char* string)
{
  int n = strlen(string);
  int m = strlen(pattern);

  int i, j;
  int **match;

  match = (int **) malloc((n + 1) * sizeof(int *));
  for(i = 0; i <= n; i++) {
    match[i] = (int *) malloc((m + 1) * sizeof(int));
  }

  for(i = n; i >= 0; i--) {
    for(j = m; j >= 0; j--) {
      if(i == n && j == m) {
        match[i][j] = 1;
      }
      else if(i < n && j == m) {
        match[i][j] = 0;
      }
      else {
        match[i][j] = 0;
        if(pattern[j + 1] == '*') {
          if(match[i][j + 2]) match[i][j] = 1;
          if(i < n && pattern[j] == string[i] && match[i + 1][j]) match[i][j] = 1;
        }
        else if(pattern[j + 1] == '?') {
          if(match[i][j + 2]) match[i][j] = 1;
          if(i < n && pattern[j] == string[i] && match[i + 1][j + 2]) match[i][j] = 1;
        }
        else if(i < n && pattern[j] == string[i] && match[i + 1][j + 1]) {
          match[i][j] = 1;
        }
      }
    }
  }

  int result = match[0][0];

  for(i = 0; i <= n; i++) {
    free(match[i]);
  }

  free(match);

  return result;
}

int main(void)
{
  printf("is_match(dummy, dummy)  = %d\n", is_match("dummy","dummy"));
  printf("is_match(dumm?y, dummy) = %d\n", is_match("dumm?y","dummy"));
  printf("is_match(dum?y, dummy)  = %d\n", is_match("dum?y","dummy"));
  printf("is_match(dum*y, dummy)  = %d\n", is_match("dum*y","dummy")); 

  system("pause");

  return 0;
}

The time complexity of this approach is O(n * m). The memory complexity is also O(n * m) but with a simple modification can be reduced to O(m).

score 0 · Answer 8 · answered Oct 28 '10 at 16:42

0

Simple recursive implementation. It's slow but easy to understand:

int is_match(char *pattern, char *string)
{
    if (!pattern[0]) {
        return !string[0];
    } else if (pattern[1] == '?') {
        return (pattern[0] == string[0] && is_match(pattern+2, string+1))
            || is_match(pattern+2, string);
    } else if (pattern[1] == '*') {
        size_t i;
        for (i=0; string[i] == pattern[0]; i++)
            if (is_match(pattern+2, string+i)) return 1;
        return 0;
    } else {
        return pattern[0] == string[0] && is_match(pattern+1, string+1);
    }
}

Hope I got it all right.

answered Oct 28 '10 at 16:42

R.. GitHub STOP HELPING ICE

208,859
35
376
711

i think the for loop should be for(i=0;string[i]==pattern[0];i++);(note the trailing ";"),did you make a typo? – Tracy Oct 30 '10 at 13:35
actually the else if(pattern[1]=='*') branch could be replace by for( i = 0 ; string[i]==pattern[0];i++) ;//note this semi-colon return is_match(pattern+2,string+i); so how do you think? – Tracy Oct 30 '10 at 13:51
@Tracy: nope. Then "x*xxxxxy" will fail to match "xxxxxy". As I said my answer is slow but simple because handling these cases correctly is not simple unless you're willing to convert the pattern to a state machine. – R.. GitHub STOP HELPING ICE Oct 30 '10 at 17:13

score 0 · Answer 9 · answered Aug 31 '17 at 10:12

A C program to find the index,from where the sub-string in the main string is going to start. enter code here

#include<stdio.h>
int mystrstr (const char *,const char *);
int mystrcmp(char *,char *);
int main()
{
    char *s1,*s2;//enter the strings, s1 is main string and s2 is substring.
    printf("Index is %d\n",mystrstr(s1,s2));
    //print the index of the string if string is found
}
//search for the sub-string in the main string
int mystrstr (const char *ps1,const char *ps2) 
{
    int i=0,j=0,c=0,l,m;char *x,*y;
    x=ps1;
    y=ps2;
    while(*ps1++)i++;
    while(*ps2++)j++;
    ps1=x;
    ps2=y;
    char z[j];
    for(l=0;l<i-j;l++)
    {
        for(m=l;m<j+l;m++)
            //store the sub-string of similar size from main string
            z[c++]=ps1[m];
        z[c]='\0'
        c=0;
        if(mystrcmp(z,ps2)==0)
        break;
    }
    return l;
}


int mystrcmp(char *ps3,char *ps4) //compare two strings
{
    int i=0;char *x,*y;
    x=ps3;y=ps4;
    while((*ps3!=0)&&(*ps3++==*ps4++))i++;      
    ps3=x;ps4=y;
    if(ps3[i]==ps4[i])
        return 0;
    if(ps3[i]>ps4[i])
        return +1;
    else
        return -1;
}

How do I write a simple regular expression pattern matching function in C or C++?

9 Answers9

Linked