I have source code with Cyrillic characters in comments and strings. MSVC allows cyrillic characters in identifiers. How to find all that cyrillic characters ignoring all comments and strings? I want to do this without the use of gcc or scripts, perfectly with simple regex search. It is not difficult to find comments /*.*?*/ , but how to find something not in the comments and not from ASCII character set?
Asked
Active
Viewed 828 times
1
-
Does it have to be using regex? – Benjamin Lindley May 23 '12 at 21:12
-
1Um. `// this is a comment in C++`, and `" this\" is a string \\"` so (as always with regexes), there's a bit more to it than you say ;-) – Steve Jessop May 23 '12 at 21:12
-
Between comment delimiters in strings, strings in comments, `#if 0`, digraphs, trigraphs, etc., it's going to be difficult to get meaningful results from regexes unless your code base is quite restricted, or you're willing to put up with quite a few incorrect results. IOW, @SteveJessop is right, but it's really even much worse than he implies. – Jerry Coffin May 23 '12 at 21:17
-
1That said, identifying the non-ASCII characters should be easy, at least in some regex engines, (http://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters). It's just that contrary to what the question says, detecting the comments and strings is the difficult part. – Steve Jessop May 23 '12 at 21:19
-
Steve, yes :) a little bit more. It's not difficult to google a regex to catch a C++ string. But I need to filter it out from the search, and I do not know how to do that. If I have regex A and regex B, how to write a regex of the form "A but not [a part of] B". – ZAB May 23 '12 at 21:20
-
It won't involve regexes (at all), but you could use the code I posted in a [previous answer](http://stackoverflow.com/a/2319255/179910), and check the characters in identifiers. – Jerry Coffin May 23 '12 at 21:24
-
So if you have a regex you're happy with for stuff to ignore, then you want to check whether the file matches `((ignorable_stuff)|(ascii_character))*`, don't you? Or am I missing something important about the way that greedily matches? – Steve Jessop May 23 '12 at 21:24
-
Steve, it's whould be difficult than to find where this file contain Cyrillic character in identifier. The idea was to write regex, and find it in solution right from MSVC. MSVC can search for regex. – ZAB May 23 '12 at 21:32
1 Answers
0
Let's assume that all comments behave like '//'--even the ones that are '/* comment */'--in the sense that once a comment starts you won't have more code after the comment on the same line. Try piping your source file through this:
perl -lne 'print $1 if m{^([^/]+)(?:/[/*])?}'
That will get you everything but the comments.
The remaining problem is a function of the character set. If it is Windows-1251, you can look for patterns like this: '[^\x00-\x7f]+'

cheapwax
- 36
- 3
-
The question was no to pipe source through one regex, cut the comments and than match it against other regex. The question was to cut and match at the same time in one regex. It is useful when you search for regex in MSVC for example. – ZAB Nov 28 '12 at 06:17