0

When performing a regex search in Python, even when re.MULTILINE isn't enabled,

The expression A[\s]B will match against

A
B

Since a newline matches \s.

Besides splitting the string into lines and operating on each - Is there an efficient way to make the expressions delimit on newlines?


Edit: I know its possible to use [\t ] or [^\S\r\n], the issue is I don't control the input in this case, users will enter \s and won't expect it to spand lines. I'm not interested to try to tell the users they are wrong, from their perspective this is a bug.

So if the answer is "it can't be done without splitting lines" - so be it.


Note that operating on a file line by line is approximately twice as slow in my tests.

ideasman42
  • 42,413
  • 44
  • 197
  • 320
  • Do you ask how to match only *horizontal whitespace*? – Wiktor Stribiżew Jul 24 '17 at 16:10
  • I think so, yes. – ideasman42 Jul 24 '17 at 16:11
  • Use `[^\S\r\n]` – Wiktor Stribiżew Jul 24 '17 at 16:11
  • The issue is I'm writing a program where the expressions are input from the users, so I would like `\s` to behave usefully and not catch newlines. I could replace `\s` with something else, but this seems risky. – ideasman42 Jul 24 '17 at 16:13
  • No, ask the users to use correct patterns if they are allowed to input regular expressions. That is the correct way. – Wiktor Stribiżew Jul 24 '17 at 16:13
  • In this context the behavior they would want is to have \s match only horizontal whitespace (I think), so probably I just need to split the file by lines and match against those. – ideasman42 Jul 24 '17 at 16:14
  • Yes, then it is the only way. However, it is much better to let users know how the tool works, and let them use what they want. They can't expect `\s` not to match line breaks, it is what it always does. – Wiktor Stribiżew Jul 24 '17 at 16:17
  • The other question is about perl, also I know *how* to match non-vertical whitespace, I want to know how to use `\s` without it matching vertical whitespace. – ideasman42 Jul 24 '17 at 16:28
  • And [here is your answer](https://stackoverflow.com/a/17752989/3832970). And [here is another one](https://stackoverflow.com/a/25955005/3832970). All in one and the same thread. Python `re` is also based on a Perl regex flavor. To make `\s` fail to match a line break char, you need to modify the pattern: 1) use a negated character class with an opposite shorthand character class and the line break chars or 2) use a negative lookahead that will restrict the `\s` pattern. – Wiktor Stribiżew Jul 24 '17 at 16:30
  • I've figured out my own answer, which allows using `\s`, avoids splitting lines to maych and is not on those threads, which is hint to me that this isn't a duplicate. Would like to reopen so I can post it. – ideasman42 Jul 24 '17 at 16:42
  • Please post, let's revise it. – Wiktor Stribiżew Jul 24 '17 at 17:39
  • Here you go, inform users that both `\s` and `\h` are available and tell them the difference. Then import the _`regex`_ library into your program. Or, if you're determined to use `re`, then find `(?<!\\)((?:\\\\)*)\\h` and replace `$1[^\S\r\n]` before compiling and running the regex. –  Jul 24 '17 at 18:41

2 Answers2

1

Technically, \s is just a shorthand for [ \t\r\n\f]

Which means that running replace all ([^\\]|^)(\\\\)*\\s with $1$2[ \t\r\n\f] on the regex pattern will have no effect. (have to not capture escaped \s) So TECHNICALLY, you can just simplify the above so that the character class \s is only [ \t].

Of course, as others have said, changing the functionality of regex without telling the end user is very very bad, and it would probably be easier to explain/implement replacing all spaces in the regex with the character class [ \t] (as this is a smaller change to the base rule set). If there is a special reason the end user thinks \s can't capture new lines, than you should probably parse the file the same way the end user expects so that the code logic matches the end user logic.

Tezra
  • 8,463
  • 3
  • 31
  • 68
0

Short answer is no, Python's regex cant be made so that \s wont match \n.

What you can do is detect '\n' in the matches and skip over those.

def finditer_delimit_newlines(pattern, string, delimit_newlines=True):
    matches = list(re.finditer(pattern, string))
    if not matches:
        return []

    end = matches[-1].start()
    newline_table = {-1: 0}
    for i, m in enumerate(re.finditer(r'\n', string), 1):
        offset = m.start()
        newline_table[offset] = i
        if offset > end:
            break

    for m in matches:
        m_start = m.start()
        m_end = m.end()
        newline_offset = string.rfind('\n', 0, m_start)
        newline_end = string.find('\n', m_end)
        if delimit_newlines:
            if ((newline_table[newline_offset] + 1) !=
                (newline_table[newline_end]
                 if newline_end != -1 else len(newline_table))
            ):
                continue
        yield m


search = """A
B

A B"""

import re

for delimit_newlines in (False, True):
    print("Test:", delimit_newlines)
    for a in finditer_delimit_newlines(r'[A-Z]\s[A-Z]', search, delimit_newlines):
        print(a)

This test outputs

Test: False
<_sre.SRE_Match object; span=(0, 3), match='A\nB'>
<_sre.SRE_Match object; span=(5, 8), match='A B'>
Test: True
<_sre.SRE_Match object; span=(5, 8), match='A B'>

Edit, a match can capture trailing newlines as part of regular whitespace, while its possible to detect this, it might be simpler use a similar method that re-matches the results on limited ranges if newlines exist.

ideasman42
  • 42,413
  • 44
  • 197
  • 320