0

I am a beginner to Regular expressions although I know how to use them, searching, replacing...

I want to write a program that detects C++ valid identifiers. e.g:

_ _f _8 my_name age_ var55 x_ a

And so on...

So I've tried this:

std::string str = "9_var 57age my_marks cat33 fit*ell +bin set_";
std::string pat = "[_a-z]+[[:alnum:]]*";
std::regex reg(pat, std::regex::icase);
std::smatch sm;
if(std::regex_search(str, sm, reg))
    std::cout << sm.str() << '\n';
else
    std::cout << "no valid C++ identifier found!\n";

The output:

_var

But as we know a C++ identifier should not start with a digit so 9_var mustn't be a candidate for the matches. But what I see here is the compiler takes only the sub-string _var from 9_var and treated it as a much. I want to discard a whole word such "9_var". I need some way to get only matches those only start with an alphabetic character or an underscore.

So how can I write a program that detects valid identifiers? Thank you!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Itachi Uchiwa
  • 3,044
  • 12
  • 26
  • 2
    Read several chapters of the [Dragon book](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools). Consider using [GNU bison](https://www.gnu.org/software/bison/) or [ANTLR](https://antlr.org/). Study the source code of [GCC](http://gcc.gnu.org/) – Basile Starynkevitch Jan 15 '21 at 20:48
  • @BasileStarynkevitch: Thanks for the link. However I am new to RE library so could you just answer the question? And later on when I finish my book C++ primer. I'll read that book about compiler because many recommended me to read it. – Itachi Uchiwa Jan 15 '21 at 20:51
  • I am unable to answer in a few paragraphs something which take a hundred pages to be explained. Read also [n3337](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3337.pdf) or some newer C++ standard – Basile Starynkevitch Jan 15 '21 at 20:52
  • @BasileStarynkevitch: You mean it is difficult for me to achieve? – Itachi Uchiwa Jan 15 '21 at 20:53
  • I mean that you need to read several hundreds pages. I don't know you enough to tell if it is difficult for you or not. I recommend following university courses on compilation. Both [GCC](http://gcc.gnu.org/) and [Clang](http://clang.llvm.org/) are open source C++ compilers. You are allowed to download their source code and study it. You could be interested by [GNU flex](https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)) - which is also [open source software](https://en.wikipedia.org/wiki/Open-source_software).... – Basile Starynkevitch Jan 15 '21 at 20:54
  • 1
    You could also study for inspiration the source code of [fish](http://fishshell.com/) or of [ninja](http://ninja-build.org/). My opinion is that regular expressions might not be the best approach ... I would recommend [finite-state machines](https://en.wikipedia.org/wiki/Finite-state_machine) generators... or [parser generator](https://en.wikipedia.org/wiki/Compiler-compiler)s – Basile Starynkevitch Jan 15 '21 at 21:00
  • 2
    Don't forget you need to exclude keywords. (Not sure, if `std::regex` allows for lookahead/lookbehind, but good luck writing that as a regex, if it's not supported; if it does support lookahead/behind, simply add a positive lookbehind matching space(and any other separator chars) and start of sequence at the start and a positive lookahead that matches space or end of sequence at the end) – fabian Jan 15 '21 at 21:00
  • @fabian but also include tabs, not just spaces, also newlines, etc. – lionkor Jan 15 '21 at 22:24
  • 1
    Does this answer your question? [How do I find a complete word (not part of it) in a string in C++](https://stackoverflow.com/questions/22516463/how-do-i-find-a-complete-word-not-part-of-it-in-a-string-in-c) – Wiktor Stribiżew Jan 23 '21 at 21:46
  • If `\b` does not work, if you also need to count `_` as a non-word char, you may simply use a group: `std::string pat = "(?:^|[^\\W_])([_a-z]\\w*)\\b";` – Wiktor Stribiżew Jan 23 '21 at 21:51

1 Answers1

1

Your pattern isn't checking for word boundaries, so it's able to match parts of a string. An updated regex looks like this:

std::string pat = "\\b[_a-z]+[[:alnum:]]*\\b";

With only that updated, the match is the first valid identifier in your string.

$ ./a.out 
my_marks

If you want to find all the valid identifiers, you'll need to loop. You'll also need to filter out reserved words, but regex isn't a good solution for that.

Stephen Newell
  • 7,330
  • 1
  • 24
  • 28