Sed regexp Match only non-valid c++ identifier characters to rename a variable

Question

I want to use sed to rename variable names (identifiers). I want to do it for c++ however for other languages it will be similar. Say we have a code sample like that here: example.cpp

int hi;
int bye;
...//a lot of code with many occurences of n

Assumed for any reason I want to rename hi in hello. The problem is hi can occur as a part of other words. In C++ a valid identifiers has the following receipt :[[[:alpha:]]_]+[[[:alnum:]]_] (Putting extended characters like ä or 龍 aside. I do not know if alnum includes these but if they are no problem expect extended punctuation characters maybe but who uses them)

There must be a character not pertaining to this expression next to a valid identifier to distinguish it from other identifiers. So before and behind n an [[[:alnum:]]_] is not allowed while any other character may. Another problem are string in "". This all only works if strings are always on-liners. Then we must check for odd occurences of unescpaped " and it may be a mathematical issue if we can do this with regular expressions however I did not come to this point trying this the first time without string recognising:

sed -i -e 'hi/\([^[[:alnum:]]_]\)hello\([^[[:alnum:]]_]\)/\1r\2/g' example.cpp

It doesnt changed anything

It's not really feasible to do this with a regular expression and `sed`, it can't determine all the context properly. Most IDEs have a "rename variable" operation, they know how to parse the language and find actual variable uses. — Barmar, Aug 03 '23 at 17:11
Your IDE might have function to rename variable (compiler has context to not replace any `n` by `r`) — Jarod42, Aug 03 '23 at 17:12
Instead of beginning to start with replacement, just *search* for it, to make sure the regex is correct. When it is, then you do the replacement, but not in place, let `sed` output the new file to make sure that it does the correct thing. And finally you do the actual replacement (but keep the original!). — Some programmer dude, Aug 03 '23 at 17:13
Ideally you would not use regex but a parse tree and replace occurences from there. You could have a look at using LLVM and AST (abstract syntax tree): e.g. [introduction to the clang AST](https://releases.llvm.org/3.3/tools/clang/docs/IntroductionToTheClangAST.html) — Pepijn Kramer, Aug 03 '23 at 17:17
Could I talk you into giving the variables descriptive names while performing the substitution? Might not make your life easier right now, but the next sucker to have to deal with this code, and that could be future you, will love you. — user4581301, Aug 03 '23 at 18:10
*"It doesnt changed anything"* ... I think using `[[:alpha:]_]+[[:alnum:]_]` will help with at least one part of yourr problem. Good points above, but good luck! — shellter, Aug 04 '23 at 00:05

stevesliva · Answer 1 · 2023-08-28T16:38:22.980

Your sed is garbled -- there's no s/// substitution.

Anyways all that you need are word boundaries (\b) in the match side of the substitution:

sed 's/\bhi\b/hello/' example.cpp

Above does almost the same as this:

sed -E 's/([^[:alnum:]_])hi([^[:alnum:]_])/\1hello\2/' example.cpp

... except that above depends upon the match groups being nonzero size.

More discussion of word boundary here.

Note also that your character classes have more square brackets than needed. The negation of [[:alnum:]] is [^[:alnum:]], so your non-word character class should be [^[:alnum:]_]. And that is equivalent to \W in extended regexp (ERE), so you can also do this with sed -E:

sed -E 's/(\W)hi(\W)/\1hello\2/' example.cpp

... again with the caveat that hi has to have a nonword character before or after (which is maybe a safe assumption for a C variable).

To fix that, you can add the line beginning ^ and end $ cases to this, too, which allows a zero-size match in those cases:

sed -E 's/(^|\W)hi(\W|$)/\1hello\2/' example.cpp

(Above likely works perfectly well, same as sed 's/\bhi\b/hello/')

Or you can use perl regex (PCRE) to make the match groups nonconsuming lookbehind (?<=) and lookahead (?=):

perl -pe 's/(?<=\W)hi(?=\W)/hello/' example.cpp

Same as this, inverting the char groups and negating the lookbehind and lookahead:

perl -pe 's/(?<!\w)hi(?!\w)/hello/' example.cpp

As you climb the scale of GNU regex feature set, you could test the matching for all with grep:

$ grep --color '\bhi\b' example.cpp
$ grep -E --color '(^|\W)hi(\W|$)' example.cpp
$ grep -P --color '(?<!\w)hi(?!\w)' example.cpp

... so you will see hi highlighted in color using basic, extended (ERE), and perl (PCRE) regex, all supported by grep. (The ERE above also highlights the nonword chars, if any, before or after)

But all regexp styles support the always-convenient zero-size match of \b for word boundaries. So, use it.

Sed regexp Match only non-valid c++ identifier characters to rename a variable

1 Answers1