0

Is there a regex for not including a given word, but matching another pattern?

I have a simple pattern like the following for grabbing words in a parser I'm using.

field = re.compile(r"[a-zA-Z0-9]+")

It works fine for the parser to determine whether something is a variable or function, but I'm running into an issue where it grabs the closing code blocks, which use the end keyword.

    foo = 3
end if <-- end is a keyword and should not be counted as a variable

Is there a way to update the regex to match all the words it currently matches except for the word end?

foo would be a match.

en would be a match.

end would not be a match.

endx would be a match.

voodoogiant
  • 2,118
  • 6
  • 29
  • 49

1 Answers1

2

In the comments @phylogenesis provided a working answer

\b(?!end\b)[a-zA-Z0-9]+

I'll explain why/how this regex answers your question.

The key is the negative lookahead (?!end\b) with the word boundaries \b performing a crucial supporting role.

The leading \b ensures that your pattern matches from the beginning of a word, then the negative lookahead (?!end\b) only allows the pattern to match if the first three letters of the word are not e,n,d followed by a word boundary (a.k.a. the word 'end'). The word boundary in this lookahead makes sure that it doesn't weed out words like 'endive'.

Will Barnwell
  • 4,049
  • 21
  • 34