22

I'm trying to use the git diff --word-diff-regex= command and it seems to reject any types of lookaheads and lookbehinds. I'm having trouble pinning down what flavor of regex git uses. For example

git diff --word-diff-regex='([.\w]+)(?!>)'

Comes back as an invalid regular expression.

I am trying to get all the words that are not HTML tags. So the resulting matches of the regex should be 'Hello' 'World' 'Foo' 'Bar' for the below string

<p> Hello World </p><p> Foo Bar </p>
Papajohn000
  • 809
  • 1
  • 16
  • 32

2 Answers2

12

The Git source uses regcomp and regexec, which are defined by POSIX 1003.2. The code to compile a diff regexp is:

            if (regcomp(ecbdata->diff_words->word_regex,
                        o->word_regex,
                        REG_EXTENDED | REG_NEWLINE))

which in POSIX means that these are "extended" regular expressions as defined here.

(Not every C library actually implements the same POSIX REG_EXTENDED. Git includes its own implementation, which can be built in place of the system's.)

Edit (per updated question): POSIX EREs have neither lookahead nor lookbehind, nor do they have \w (but [_[:alnum:]] is probably close enough for most purposes).

snipsnipsnip
  • 2,268
  • 2
  • 33
  • 34
torek
  • 448,244
  • 59
  • 642
  • 775
  • No wonder. I was banging my head for why `\w+` won't work. Thanks for the hint of this answer, now `[[:alnum:]]+` seems to work. I still haven't made up my mind to learn and remember a new set of regex rules, though. – RayLuo Apr 20 '20 at 08:04
  • 2
    @RayLuo: there are too many to keep them all straight, but fortunately there are web sites for that. See [this question](https://stackoverflow.com/q/3226325/1256452) and its links, including [regular-expressions.info](http://www.regular-expressions.info/refflavors.html) and [Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines). I think I like [this wikipedia page](https://en.wikipedia.org/wiki/Regular_expression) more though. – torek Apr 20 '20 at 08:09
1

Thanks for the hints from @torek 's answer above, now I realize that there are different flavors of regular expression engines and they could even have different syntax.

Even for one particular program, such as git, it could be compiled with a different regex engine. For example, this blog post hints that \w would be supported by git, contradicting with what I observed from my machine or what the OP here asked.

I ended up finding this section from your recommended wikipedia page most helpful, in terms of presenting different syntax in one table, so that I could do some "translation" between for example [:alnum:] and \w, [:digit:] and \d, [:space:] and \s, etc..

RayLuo
  • 17,257
  • 6
  • 88
  • 73