1

I would like to write a regular expression that matches any word. I use [a-zA-Z]* except for some words, for example WORD1 and WORD2.

So somethingsomething matches, but the words WORD1 and WORD2 won't. Is it possible in flex?

I have tried:

[a-zA-Z]*|[^"WORD1""WORD2] and [a-zA-Z]*{-}["WORD1""WORD2"] but neither works.

(Now I know why they don't work but I still don't know the solution.)

Nathaniel Ford
  • 20,545
  • 20
  • 91
  • 102
Miklos
  • 101
  • 2
  • 11
  • 1
    [Lookaheads](http://www.regular-expressions.info/lookaround.html) are not available in flex regex are they? – bobble bubble Feb 29 '16 at 10:16
  • 1
    @bobblebubble: [It doesn't look good.](http://stackoverflow.com/q/22326399/20670) – Tim Pietzcker Feb 29 '16 at 10:41
  • 1
    I really don't understand what you're asking here. `[a-zA-Z]*` won't match `WORD1`, although it will match `WORD`. What did you want to happen when WORD1 is encountered? An error? A different token type? Two tokens? Only if we know what you want the result of scanning `WORD1` to be can we provide a suggestion for how to implement it. – rici Mar 01 '16 at 00:28
  • What do you mean by "any word"? As in, any English word, in which case you need to allow for apostrophes (or you *won't* match words like "don't")? – nnnnnn Mar 02 '16 at 09:46

2 Answers2

1

The usual approach in lex/flex is to use a combination of patterns and rules to select the desired behavior:

  • You could make a lexer which matches (and ignores) those words and then falls through to the expression for other identifiers, or
  • Simply match all identifiers and filter out the results with a lookup table.
Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
1

It is possible to write a regular expression for the situation you presented.

In order to match all words except word you can write:

w|wo|wor|word[a-z]+|([^w]|w[^o]|wo[^r]|wor[^d])[a-z]*

But as @Thomas and @rici pointed out, you have much better solutions (specially when you define a concrete problem).

Example: "count all words except the word word" is in fact very simple (using Thomas proposal):

%%
  int i;

word        {    }
[a-z]+      {i++;}

.|\n        {    }
<<EOF>>     { printf("%d\n",i); return 0; }
%%

(untested)

JJoao
  • 4,891
  • 1
  • 18
  • 20