8

For non-MATLAB-savvy readers: not sure what family they belong to, but the MATLAB regexes are described here in full detail. MATLAB's comment character is % (percent) and its string delimiter is ' (apostrophe). A string delimiter inside a string is written as a double-apostophe ('this is how you write "it''s" in a string.'). To complicate matters more, the matrix transpose operators are also apostrophes (A' (Hermitian) or A.' (regular)).

Now, for dark reasons (that I will not elaborate on :), I'm trying to interpret MATLAB code in MATLAB's own language.

Currently I'm trying to remove all trailing comments in a cell-array of strings, each containing a line of MATLAB code. At first glance, this might seem simple:

>> str = 'simpleCommand(); % simple trailing comment';
>> regexprep(str, '%.*$', '')
ans =
    simpleCommand(); 

But of course, something like this might come along:

>> str = ' fprintf(''%d%*c%3.0f\n'', value, args{:}); % Let''s do this! ';
>> regexprep(str, '%.*$', '') 
ans = 
    fprintf('        %//   <-- WRONG!

Obviously, we need to exclude all comment characters that reside inside strings from the match, while also taking into account that a single apostrophe (or a dot-aposrotphe) directly following a statement is an operator, not a string delimiter.

Based on the assumption that the amount of string opening/closing characters before the comment character must be even (which I know is incomplete, because of the matrix-transpose operator), I conjured up the following dynamic regex to handle this sort of case:

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
       'A = A.'';%tight trailing comment'
   };
>> 
>> C = regexprep(str, '(^.*)(?@mod(sum(\1==''''''''),2)==0;)(%.*$)', '$1')

However,

C = 
    'myFun( {'test' '%'}); '              %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '  %// sucess
    'sprintf(str, '%*8.0f%*s%c'           %// FAIL
    'A = A.';'                            %// success (although I'm not sure why)

so I'm almost there, but not quite yet :)

Unfortunately I've exhausted the amount of time I can spend thinking about this and need to continue with other things, so perhaps someone else who has more time is friendly enough to think about these questions:

  1. Are comment characters inside strings the only exception I need to look out for?
  2. What is the correct and/or more efficient way to do this?
Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
  • I don't know matlab but how about using non greedy quantifier: `%.*?$` – Toto Jun 28 '13 at 07:36
  • @M42: well that seems to fix the issue, at least for my small subset of tests...Can you post as an answer? – Rody Oldenhuis Jun 28 '13 at 07:41
  • @M42: no wait, I didn't look close enough -- it *doesn't* fix the issue...in fact, it doesn't do anything anymore :) – Rody Oldenhuis Jun 28 '13 at 07:42
  • Well, if instructions end with semicolon, may be this will work: substitute `;\s*%.*?$` by `;` – Toto Jun 28 '13 at 07:49
  • @M42: unfortunately, not all lines end in semicolon (for example, `if (condition) % trailing comment` is a commonly occuring pattern) – Rody Oldenhuis Jun 28 '13 at 07:52
  • Too bad, so I guess the best way is a parser. – Toto Jun 28 '13 at 07:57
  • Are you going to keep directives like `%#ok` and `%#codegen`, or it is OK to remove them as well? – Mohsen Nosratinia Jun 28 '13 at 08:10
  • @MohsenNosratinia: I'm only interested in the statements, so it's OK to remove them – Rody Oldenhuis Jun 28 '13 at 08:21
  • Somehow I feel that it would be easier to write a finite state machine parser than to write its corresponding regex. – John Dvorak Jun 28 '13 at 08:37
  • @JanDvorak: Probably you are right...I'm just exploring the possibility that I've overlooked something obvious (or not-so-obvious-and-easy-to-miss) – Rody Oldenhuis Jun 28 '13 at 08:46
  • @RodyOldenhuis if Matlab supports unbounded lookbehind, you can convert that FSM into a regex, and use that as the lookbehind. The starting and ending state for the lookbehind would be the "normal" context. The part after the lookbehind would be easy. – John Dvorak Jun 28 '13 at 08:50
  • @JanDvorak: AFAIK, MATLAB indeed has support for this. It's then "just" a matter of finding the regex that finds: "the longest substring where everything before the comment character contains an even number of string enclosing characters, where the string enclosing characters are counted *only* if they are not directly attached to a non-space character that is not in a string itself."...sounds like fun :) – Rody Oldenhuis Jun 28 '13 at 09:10
  • So, every apostrophe that is not directly after `A`, `A.` or commented out is a string delimiter? – John Dvorak Jun 28 '13 at 09:13
  • How does matlab treat newlines inside strings? – John Dvorak Jun 28 '13 at 09:15
  • @JanDvorak: indeed (of course, your '`A`' is any valid statement, and '`'`' and '`.'`' are the transpose operators). Newlines in strings are not supported directly (you'd construct a 2D array of strings to accomplish that, *or* use `char(10)` in a 1D array, *or* use `\n` in `fprintf()` and friends.) – Rody Oldenhuis Jun 28 '13 at 09:37
  • umm... then I need the grammar for a valid statement, or at least a grammar disambiguating between the valid context for a matrix transposition and the valid context for a string literal. I don't know the syntax of Matlab, but I do understand the theory of parsing a specific grammar. – John Dvorak Jun 28 '13 at 09:40
  • @JanDvorak: the details are of course a bit too long for a comment, but I think you'll find [this question](http://stackoverflow.com/questions/3627107/how-can-i-index-a-matlab-array-returned-by-a-function-without-first-assigning-it) useful. For the transposes: every apostrophe preceeded by a non-whitespace and non-apostrophe character, is a matrix transpose operator, provided it is not inside an apostrophe-delimited string. – Rody Oldenhuis Jun 28 '13 at 10:01
  • @JanDvorak: An apostophe not preceeded by a string itself, and *directly* preceeded by `\s`, `\(`, `\[` or `\{`, is a string opening character. The next odd apostrphe, when not counting *double* apostrophes, is a string terminator (as you may notice, I'm having trouble even defining it :)) – Rody Oldenhuis Jun 28 '13 at 10:02
  • by "not preceeded by a string itself", do you mean that in `'hello' '`, the last `'` denotes a matrix transpose? What about `'hello' op 'world'`? – John Dvorak Jun 28 '13 at 10:05
  • @JanDvorak: `'hello''` is not valid (directly preceeded by apostrophe), while `'hello'.'` is indeed a valid transpose of the character array. Something like `'hello'.^ 'world'` is a valid statement (`.^` is element-wise exponentiation, and a character array is (like C) just an array of numbers) – Rody Oldenhuis Jun 28 '13 at 10:11

5 Answers5

5

How do you feel about using undocumented features? If you dont object, you can use the mtree function to parse the code and strip the comments. No regexps involved, and we all know that we shouldn't try to parse context-free grammars using regular expressions.

This function is a full parser of MATLAB code written in pure M-code. As far as I can tell, it is an experimental implementation, but it's already used by Mathworks in a few places (this is the same function used by MATLAB Cody and Contests to measure code length), and can be used for other useful things.

If the input is a cellarray of strings, we do:

>> str = {..};
>> C = deblank(cellfun(@(s) tree2str(mtree(s)), str, 'UniformOutput',false))
C = 
    'myFun( { 'test', '%' } );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'sprintf( str, '%*8.0f%*s%c%3d\n' );'
    'A = A.';'

If you already have an M-file stored on disk, you can strip the comments simply as:

s = tree2str(mtree('myfile.m', '-file'))

If you want to see the comments back, add: mtree(.., '-comments')

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • to be exact `mtree` calls a builtin function `mtreemex`, so its not a pure M-code function – Amro Jun 28 '13 at 18:35
4

This matches conjugate transpose case by checking what characters are allowed before one

  1. Numbers 2'
  2. Letters A'
  3. Dot A.'
  4. Left parenthesis, brace and bracket A(1)', A{1}' and [1 2 3]'

These are the only cases I can think of now.

C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

on your example we it returns

>> C = regexprep(str, '^(([^'']*''[^'']*''|[^'']*[\.a-zA-Z0-9\)\}\]]''[^'']*)*[^'']*)%.*$', '$1')

C = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'
Mohsen Nosratinia
  • 9,844
  • 1
  • 27
  • 52
  • +1 (again): starting to look good...further testing required though. – Rody Oldenhuis Jun 28 '13 at 10:03
  • @M42 Your comment is superfluous since he's using `[^'']*`. – Oleg Jun 28 '13 at 13:10
  • @M42 It's not the case. You see `[^'']` but it's not the actual string. Matlab will convert it to `[^']`. If you want to have an apostrophe in a MATLAB string you have to repeat it. If you run `disp('It''s a string')`, you get `It's a string`. – Mohsen Nosratinia Jun 28 '13 at 13:31
  • @MohsenNosratinia: Ok, sorry, my bad. I don't know MATLAB syntax. – Toto Jun 28 '13 at 13:38
  • Look at my answer :) Can you see if you can find a construct on which it fails? – Rody Oldenhuis Jun 28 '13 at 14:09
  • I'll still accept your answer, since also your solution has withstood my tests up to now, and your hacky efforts are greatly appreciated :) – Rody Oldenhuis Jun 28 '13 at 14:13
  • Hmm...something that hasn't occurred to me yet: multi-line comments (`%{ ... %}`) should ideally *also* be removed (completely)... – Rody Oldenhuis Jun 28 '13 at 14:26
  • Thanks for choosing this one. I think if you want to handle multi-lines you need to change your data structure. Cell array will cause more trouble. You need to bring it back to a long multi-line string. – Mohsen Nosratinia Jun 28 '13 at 14:43
  • Well, I'll treat multiline comments separately (pretty easy, although not in a one-liner AFAIK)...I think it's pretty much impossible to strip *also those* in a one-liner :) – Rody Oldenhuis Jun 28 '13 at 15:20
4

Look what I found! :)

The comment stripping toolbox, by Peter J. Acklam.

For m-code, it contains the following regex:

mainregex = [ ...
     ' (                   ' ... % Grouping parenthesis (content goes to $1).
     '   ( ^ | \n )        ' ... % Beginning of string or beginning of line.
     '   (                 ' ... % Non-capturing grouping parenthesis.
     '                     ' ...
     '' ... % Match anything that is neither a comment nor a string...
     '       (             ' ... % Non-capturing grouping parenthesis.
     '           [\]\)}\w.]' ... % Either a character followed by
     '           ''+       ' ... %    one or more transpose operators
     '         |           ' ... % or else
     '           [^''%]    ' ... %   any character except single quote (which
     '                     ' ... %   starts a string) or a percent sign (which
     '                     ' ... %   starts a comment).
     '       )+            ' ... % Match one or more times.
     '                     ' ...
     '' ...  % ...or...
     '     |               ' ...
     '                     ' ...
     '' ...  % ...match a string.
     '       ''            ' ... % Opening single quote that starts the string.
     '         [^''\n]*    ' ... % Zero or more chars that are neither single
     '                     ' ... %   quotes (special) nor newlines (illegal).
     '         (           ' ... % Non-capturing grouping parenthesis.
     '           ''''      ' ... % An embedded (literal) single quote character.
     '           [^''\n]*  ' ... % Again, zero or more chars that are neither
     '                     ' ... %   single quotes nor newlines.
     '         )*          ' ... % Match zero or more times.
     '       ''            ' ... % Closing single quote that ends the string.
     '                     ' ...
     '   )*                ' ... % Match zero or more times.
     ' )                   ' ...
     ' [^\n]*              ' ... % What remains must be a comment.
              ];

  % Remove all the blanks from the regex.
  mainregex = mainregex(~isspace(mainregex));

Which becomes

mainregex  = '((^|\n)(([\]\)}\w.]''+|[^''%])+|''[^''\n]*(''''[^''\n]*)*'')*)[^\n]*'

and should be used as

C = regexprep(str, mainregex, '$1')

So far, it's withstood all of my tests, so I think this should solve my problem quite nicely :)

Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
2

I prefer to abuse checkcode (the replacement for old mlint) to do the parsing. Here is a suggestion

function strNC = removeComments(str)
if iscell(str)
    strNC = cellfun(@removeComments, str, 'UniformOutput', false);
elseif regexp(str, '%', 'once')
    err = getCheckCodeId(str);
    strNC = regexprep(str, '%[^%]*$', '');
    errNC = getCheckCodeId(strNC);
    if strcmp(err, errNC),
        strNC = removeComments(strNC);
    else
        strNC = str;
    end
else
    strNC = str;
end
end

function errid = getCheckCodeId(line)
fName = 'someTempFileName.m';
fh = fopen(fName, 'w');
fprintf(fh, '%s\n', line);
fclose(fh);
if exist('checkcode')
    structRep = checkcode(fName, '-id');
else
    structRep = mlint(fName, '-id');
end
delete(fName);
if isempty(structRep)
    errid = '';
else
    errid = structRep.id;
end
end

For each line, it checks if we introduce an error by trimming the line from last % to the end of line.

For your example it returns:

>> removeComments(str)

ans = 

    'myFun( {'test' '%'}); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n'); '
    'sprintf(str, '%*8.0f%*s%c%3d\n');  '
    'A = A.';'

It does not remove the suppression directive, %#ok, so you get:

>> removeComments('a=1; %#ok')

ans =

a=1; %#ok

Which probably is a good thing.

Mohsen Nosratinia
  • 9,844
  • 1
  • 27
  • 52
  • +1: you, my good sir, are quickly becoming my new hero! You're so full of dirty tricks :p Anyway, although I agree that this is probably the more robust and "better" way to go, it feels awfully verbose...plus, it creates a version dependency: I'm on R2010a, so I'll have to use `mlint` instead of `codecheck` (which, ideally, I'd have to check for). So I'll leave the question open for a while longer, see if a regex one-liner is *really* not possible. – Rody Oldenhuis Jun 28 '13 at 09:02
  • Thansk for the kind words! Indeed not as succinct as a one-liner `regexprep`. I'm also curious about a solution with regex but I see many special cases to treat mostly due to the fact that `´` can be used both for string and conjugate transpose. – Mohsen Nosratinia Jun 28 '13 at 09:20
  • Yes, that apostrophe is a nasty one isn't it? – Rody Oldenhuis Jun 28 '13 at 09:38
1

How about making sure all apostrophe before the comment come in pairs like this:

>> str = {
       'myFun( {''test'' ''%''}); % let''s '                 
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % it''s '        
       'sprintf(str, ''%*8.0f%*s%c%3d\n''); % let''s '       
       'sprintf(str, ''%*8.0f%*s%c%3d\n'');  '
   };

>> C = regexprep(str, '^(([^'']*''[^'']*'')*[^'']*)%.*$', '$1')

C = 
    myFun( {'test' '%'}); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
    sprintf(str, '%*8.0f%*s%c%3d\n'); 
Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
ahilsend
  • 931
  • 6
  • 15