Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset

Question

What would be the consequences of inserting a positive lookbehind for n-bytes, (?<=\C{n}), into the beginning of any arbitrary regular expression, particularly when used for replacement operations?

At least within PHP, the regex match functions, preg_match and preg_match_all, allow for matching to begin after a given byte offset. There is no corresponding feature in any of the other PCRE PHP functions - you can specify a limit to the number of replacements done by preg_replace for instance, but not that those replacements' matches must occur after n-bytes.

There would obviously be some (lets call them trivial) consequences to performance and readability, but would there be any (non-trivial) impacts, like matches becoming non-matches (except when they are not offset by n bytes) or replacements becoming malformed?

Some examples:

/some expression/ becomes /(?<=\C{4})some expression/ for a 4-byte offset

/(this) has (groups)/i becomes /(?<=\C{2})(this) has (groups)/i for a 2-byte offset

As far as I can tell, and from the limited tests that I've run, adding in this lookbehind effectively simulates this offset parameter and doesn't mess with any other lookbehinds, substitutions, or other control patterns; but I'm also not an expert on Regex.

I'm trying to determine if there are any likely consequences to building replace/filter function extensions by inserting the n-byte lookbehind into patterns. It should operate just as the match functions' offset parameter works - so simply running the expression against substr( $subject, $offset ) won't work for the same reasons it doesn't for preg_match (most notably it cuts off any lookbehinds and ^ then incorrectly matches the start of the substring, not the original string).

If you only deal with ASCII/UTF-8 with only ASCII characters, then it is probably safe (since 1 byte = 1 char). Otherwise, there WILL be correctness problem. — nhahtdh, Nov 28 '14 at 07:40
That's actually the point of using the `\C` byte pattern over something like `.` in that it will exactly match the behavior of `preg_match`'s offset parameter, which offsets in bytes, not characters. It's sometimes an annoyance, but by using `\C` and going by byte not character it is at least consistent with the `preg_match` and `preg_match_all` — Brian North, Nov 28 '14 at 07:46
Maybe some additional info from chat starting from [here](http://chat.stackoverflow.com/transcript/message/20295694#20295694) probably until [here](http://chat.stackoverflow.com/transcript/message/20296739#20296739) — HamZa, Dec 09 '14 at 10:41

nhahtdh · Accepted Answer · 2015-05-25T08:31:02.530

Short answer

In non-UTF mode, UTF-8 library

Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode

\C

is equivalent to

[\x00-\xff]

and

(?s:.)

Any of them can be used in a look-behind as replacement for offset field in preg_match and preg_match_all functions.

In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.

In UTF-mode, UTF-8 library

UTF mode can be activated by u flag in the pattern passed to preg_* function, or by specifying (*UTF), (*UTF8), (*UTF16), (*UTF32) verbs at the beginning of the pattern.

In UTF mode, character class [] and dot metacharacter . will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.

While \C is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.

UTF-16 and UTF-32 library

I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of preg_* functions may differ.

More details

At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.

For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.

Regex constructs

In non-UTF mode, character class [], dot . and \C matches exactly 1 data unit.

\C matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.

MATCHING A SINGLE DATA UNIT

Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set.
. matches 1 data unit in non-UTF mode.
General comments about UTF modes

[...]
1. The dot metacharacter matches one UTF character instead of a single data unit.
Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.

SQUARE BRACKETS AND CHARACTER CLASSES

[...]

A character class matches a single character in the subject. In a UTF mode, the character may be more than one data unit long.

The same conclusion can be reached by looking at the upper limit of \x{hh...} syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:
```
 8-bit non-UTF mode    less than 0x100
 8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
 16-bit non-UTF mode   less than 0x10000
 16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
 32-bit non-UTF mode   less than 0x100000000
 32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
```
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef.

Offset

All offset supplied and returned are in number of data units:

The string to be matched by pcre_exec()

The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.

How pcre_exec() returns captured substrings

[...]

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit off- sets, even in UTF mode.

A small note for anyone want to test. Use either pcretest or a tester run on PHP server, don't use regex101 (since it assumes UTF-16 input and 16-bit data unit). For `pcretest` (by installing pcre library, via Cygwin or other channel), I only manage to get it working with UTF-8 input. In non-UTF mode, each byte in the UTF-8 input will be copied to 1 data unit (which is weird for UTF-16 and 32). In UTF-mode, the UTF-8 input is conceptually converted to code point, then to the proper representation in UTF-8|16|32 (which is correct). — nhahtdh, Dec 15 '14 at 11:32
Great write-up! Just a note that in UTF-8 mode, a _valid_ code point can have greater than 1 data unit (as you say). Like chars \x80-\xFF have two data units. In that case, it seems generally then, characters `[\x00-\xFF]` would be excluded as a predictor of _byte_ offset. — , Dec 15 '14 at 22:47
@sln: In UTF-8 mode, none of them would work, including `\C`, since it is not allowed in look-behind. — nhahtdh, Dec 16 '14 at 02:16
That's what the docs say. In Perl 5.20 though when forced UTF-8 target, `\C` works in lookbehind on the valid codepoints tested (only tested a few). On seemingly invalid codepoints, `\C` won't go to the left (in data units) of invalid ones. Amazing and scary. — , Dec 16 '14 at 13:04
@sln: This is where PCRE demonstrates that it is not Perl :P — nhahtdh, Dec 16 '14 at 13:24

score 1 · Answer 2 · answered Dec 09 '14 at 22:35

1

You could try /(?<=[\x00-\xFF]{n})some expression/ for a 'n'-byte offset. Add anchors or some other soft anchors that do the start alignment.

answered Dec 09 '14 at 22:35

It seems to me that using `[\x00-\xFF]` would be the same as using `\C`. – HamZa Dec 09 '14 at 23:14
1

@HamZa - I know its tagged 'php'. Was going on Perlre docs - says `\C` is unsupported in lookbehinds and classes (version 5.10/20), and deprecated (version 5.20). Amazingly, `\C` works in lookbehinds in both versions and not at all in classes. I would use x00 - xFF because I don't trust \C. – Dec 10 '14 at 17:35
@nhahtdh - One thing for sure with `\C`, the data unit is either 8/16/32 bit depending on if its ascii or UTF. That dispels the _byte_ theory on anything other than ascii or UTF-8. I don't really trust `\C` in UTF context. There is always that character boundary problem in DFA paths, and injecting \C into expressions are problematic at least. Therefore, I believe the regex and target should be ASCII only. I'm sticking with `[\x00-\xFF]` – Dec 11 '14 at 18:31