21

Can someone explain what MATLAB is doing with nul bytes (x00) in regular expressions?

Examples:

>> regexp(char([0 0 0 0 0 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      1  % current
      4  % expected

>> regexp(char([0 0 0 1 0 0 0 1 0 0 10 0 0 0]),char([1 0 0 0 46 0 0 10]))
ans =
      4  % current
      4  % expected

>> regexp(char([0 0 0 1 0 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      [] % current
      [] % expected

>> regexp(char([0 0 0 0 10 0 0 1 0 0 10 0 0 0]),char([0 0 0 0 46 0 0 10]))
ans =
      1  % current
      [] % expected

>> regexp(char([0 0 0 0 0 0 0 1 0 0 10 0 0 0]),char([1 0 0 0 46 0 0 10]))
ans =
      [] % current
      [] % expected

The answer might simply be, MATLAB regular expression isn't meant to handle non printable characters, but I would assume it would error if this was the case.

EDIT: The 46 is expected to be '.' as in the regex wildcard.

EDIT2:

>> regexp(char([0 0 0 0 50 0 0 100 0 0 90 0 0 0]),char([0 0 46 0 0 90]))
ans =
     1    9

I realized it could have been 10 being a special character so this one has only printable and nul bytes. I would expect this one to only match 9 because the fifth character 50 does not match 0.

zessx
  • 68,042
  • 28
  • 135
  • 158
horriblyUnpythonic
  • 853
  • 2
  • 14
  • 34
  • Why do you expect your first example to return `4` and your fourth example to return `[]`? They seem to make sense to me. In the first case, the pattern `NUL NUL NUL NUL . NUL NUL` will match `NUL NUL NUL NUL NUL NUL NUL` at the beginning of the string. – eigenchris Mar 23 '15 at 17:53
  • @eigenchris I would think it would only match 4 because of the 10 at the end of the pattern shouldn't match the 1 in both cases, right? – horriblyUnpythonic Mar 23 '15 at 17:56
  • You're correct. The newline character wasn't behaving the way I first expected. So far this has me stumped as well. – eigenchris Mar 23 '15 at 18:12
  • Escaping the `Null` does not change the behaviour `regexp(char([0 0 0 0 50 0 0 100 0 0 90 0 0 0]),char(['\x00\x00',46,'\x00\x00',90]))` – Daniel Mar 23 '15 at 18:20
  • 1
    It appears that prefexing the regex pattern with `0 46 0` will cause an automatic match for any subsequent characters in the pattern, regardless of what they are: `regexp(char([0 0 0 1 2 3 4]),char([0 46 0 11 22 33 44]))` – eigenchris Mar 23 '15 at 18:35
  • I can't find a way to make sense of this. Octave seems to show different behaviour than MATLAB for the examples given in the original question. This may just be a quirk for MATLAB's particular implementation of a regex engine. If this is a real problem for the project you're working on, you could try asking over at [MATLAB Answers](http://www.mathworks.com/matlabcentral/answers/?s_tid=gn_mlc_an) and see if someone there can figure it out. – eigenchris Mar 23 '15 at 18:57
  • 5
    Asked here: http://www.mathworks.com/matlabcentral/answers/184656-nul-characters-and-wildcards-in-regexp – horriblyUnpythonic Mar 23 '15 at 19:26
  • Adding another 46 later in the string changes this behavior, e.g., `regexp( char([0 15 0 105 105]), char([0 46 0 106 106]) )` matches but `regexp( char([0 15 0 105 105]), char([0 46 0 46 106]) )` doesn't. – Tokkot Aug 07 '15 at 22:18
  • 2
    What `version` of Matlab are you using? I can't replicate this in R2015b, but I can in R2015a. Seems like it may be a bug that has been fixed. – horchler Sep 14 '15 at 21:28
  • 1
    @horschler Same thing for me! With 8.6.0.267246 (R2015b) I get the expected results while R2015a gives me the weird ones. – Matthew Gunn Nov 09 '15 at 10:37

1 Answers1

1

this bug is probably already fixed. I tested your example from Matlab Central in several versions:

in R2013b:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))    
ans =

     2

in R2015a:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))   
ans =  

     2

in R2016a:

>> regexp(char([0 0 1 0  41 41 41 41 41 41]),char([0 '.' 0  40 40 40 40]))
ans = 

     []
tvo
  • 780
  • 7
  • 24