Why is regex search in substring "not completely equivalent to slicing the string" in Python?

Question

As the documentation stated, using regex.search(string, pos, endpos) is not completely equivalent to slicing the string, i.e. regex.search(string[pos:endpos]). It won't do regex matching as if the string is starting from pos, so ^ does not match the beginning of the substring, but only matches the real beginning of the whole string. However, $ matches either the end of the substring or the whole string.

    >>> re.compile('^am').findall('I am falling in code', 2, 12)
    []        # am is not at the beginning
    >>> re.compile('^am').findall('I am falling in code'[2:12])
    ['am']    # am is the beginning
    >>> re.compile('ing$').findall('I am falling in code', 2, 12)
    ['ing']   # ing is the ending
    >>> re.compile('ing$').findall('I am falling in code'[2:12])
    ['ing']   # ing is the ending

    >>> re.compile('(?<= )am').findall('I am falling in code', 2, 12)
    ['am']    # before am there is a space
    >>> re.compile('(?<= )am').findall('I am falling in code'[2:12])
    []        # before am there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code', 2, 12)
    []        # after ing there is no space
    >>> re.compile('ing(?= )').findall('I am falling in code'[2:12])
    []        # after ing there is no space

    >>> re.compile(r'\bm.....').findall('I am falling in code', 3, 11)
    []
    >>> re.compile(r'\bm.....').findall('I am falling in code'[3:11])
    ['m fall']
    >>> re.compile(r'.....n\b').findall('I am falling in code', 3, 11)
    ['fallin']
    >>> re.compile(r'.....n\b').findall('I am falling in code'[3:11])
    ['fallin']

My questions are... Why is it not consistent between beginning and ending match? Why does using pos and endpos treat the end as the real end, but the start/beginning is not treated as the real start/beginning?

Is there any approach to make using pos and endpos imitate slicing? Because Python copies string when slicing instead of just reference the old one, it would be more efficient to use pos and endpos instead of slicing when working with big string multiple times.

Very strange, it seems that the new regex module has the same behaviour. — Casimir et Hippolyte, Jun 23 '15 at 10:32
It looks worth a bug report to python: http://bugs.python.org/ — Armin Rigo, Jun 23 '15 at 11:00
@ArminRigo But the documentation told it, so it might be a "feature" :) — fikr4n, Jun 24 '15 at 02:09
This is consistent with what the documentation says (it says that using `endpos` is equivalent to slicing). It is however very weird behaviour but I suspect that a bug report would be rejected on the grounds that changing this would break backwards compatibility. — Raniz, Jun 24 '15 at 02:16
A different point of view is that the start "pos" argument is meant for doing multiple search() to locate several matches in a left-to-right manner. The "endpos" on the other hand is meant to pretend the string is really sliced to this length. I suppose it is consistent with the fact that there is no "search_rightmost()" function to do right-to-left multiple search. — Armin Rigo, Jun 24 '15 at 08:16
I'm guessing you expect the second from bottom statement to return an empty list. Is that right? Can you clarify what you expect to happen in these examples? — krethika, Jun 24 '15 at 17:53
@mehtunguh About `re.compile(r'.....n\b').findall('I am falling in code', 3, 11)`, I expect nothing because the documentation has stated that, I give up :D. I am just curious why they are inconsistent. Additionally, I would like to know is there any approach to make using `pos/endpos` imitate slicing. — fikr4n, Jun 25 '15 at 01:25
If you specifically want to use `^` to match the beginning of the string, could you use `re.match()` instead? I know this is not a general case solution - but maybe it's good enough for what you want? — Brian L, Jun 25 '15 at 02:17

Antti Haapala -- Слава Україні · Answer 1 · 2015-07-22T12:33:20.080

The starting position argument pos is especially useful for doing lexical analysers for example. The performance difference between slicing a string with [pos:] and using the pos parameter might seem insignificant, but it certainly is not so; see for example this bug report in the JsLex lexer.

Indeed, the ^ matches at the real beginning of the string; or, if MULTILINE is specified, also at the beginning of line; this is also by design so that a scanner based on regular expressions can easily distinguish between real beginning of line/beginning of input and just some other point on a line/within the input.

Do note that you can also use the regex.match(string[, pos[, endpos]]) function to anchor the match to the beginning string or at the position specified by pos; thus instead of doing

>>> re.compile('^am').findall('I am falling in code', 2, 12)
[]

you'd generally implement a scanner as

>>> match = re.compile('am').match('I am falling in code', 2, 12)
>>> match
<_sre.SRE_Match object; span=(2, 4), match='am'>

and then set the pos to match.end() (which in this case returns 4) for the successive matching operations.

The match must be found starting exactly at the pos:

>>> re.compile('am').match('I am falling in code', 1, 12)
>>>

(Notice how the .match is anchored at the beginning of the input as if by implicit ^ but not to the end of the input; indeed this is often a source of errors as people believe the match has both implicit ^ and $ - Python 3.4 added the regex.fullmatch that does this)

As for why the endpos parameter is not consistent with the pos - that I do not know exactly, but it also makes some sense to me, as in Python 2 there is no fullmatch and there anchoring with $ is the only way to ensure that the entire span must be matched.

Samuel O'Malley · Answer 2 · 2015-06-27T02:56:50.237

0

This sounds like a bug in Python, but if you want to do slice by reference instead of copying the strings you can use the Python builtin buffer.

For example:

s = "long string" * 100
buf = buffer(s)
substr = buf([5:15])

This creates a substring without copying the data, so should allow for efficient splitting of large strings.

edited Jun 27 '15 at 02:56

answered Jun 27 '15 at 02:51

Samuel O'Malley

3,471
1
23
41

Nice info about `buffer`, unfortunately it is not available in Python 3. – fikr4n Jun 29 '15 at 02:04
1

@BornToCode: Unfortunately, I haven't managed to get the Python3 replacement `memoryview` to work properly with string slices. I thought I'd mention buffer anyway because there are still many Python2 users, and the question didn't specify a version. – Samuel O'Malley Jun 29 '15 at 02:15
I'd be careful about making assumptions about the relationship between strings and buffers. The latter are both byte-oriented, while strings (and regexes) operate on full Unicode, especially in Python3. So, the slice `s[2:4]` refers to two *characters*, however they may be represented as 3 to 6 *bytes*. That means you need to be careful about assuming you can do raw memory operations on strings. This suggests that the original poster should consider the big picture a bit more, and consider an algorithmic solution rather than delving too deep into micro-efficiencies of python code. – Gary Wisniewski Jul 07 '15 at 22:17

Why is regex search in substring "not completely equivalent to slicing the string" in Python?

2 Answers2