2

I want to find a better way to get my result. I use a regex pattern to match all text of the form (DD+ some text DDDD some other text) if and only if it is not preceded of non-fixed width lookbehind terms. How can I include these terms inside of my REGEX pattern ?

aa = pd.DataFrame({"test": ["45 python 00222 sometext",
                            "python white 45 regex 00 222 somewhere",
                            "php noise 45 python 65000 sm",
                            "otherword 45 python 50000 sm"]})
pattern = re.compile("(((\d+)\s?([^\W\d_]+)\s?)?(\d{2}\s?\d{3})\s?(\w.+))")
aa["result"] = aa["test"].apply(lambda x: pattern.search(x)[0] if pattern.search(x) else None)
lookbehind = ['python', 'php']
aa.apply(lambda x: "" if any(look in x["test"].replace(x["result"], "") for look in lookbehind) else x["result"], axis=1)

The output is what I expected

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm
J. Doe
  • 3,458
  • 2
  • 24
  • 42
  • You would like to get rid of the whitespace in the "expected output", or what's the matter? Use https://pypi.org/project/regex/ – wp78de Oct 09 '18 at 15:41
  • Is trying an alternative regular expression packages an option? for example, the `regex` package supports variable-length lookbehind. See: https://stackoverflow.com/questions/24987403/variable-width-lookbehind-issue-in-python – Tomalak Oct 09 '18 at 15:41
  • @wp78de no I would like to get rid of the 2 last lines. That is I would like to allow non fixed width lookbehind in my regex pattern – J. Doe Oct 09 '18 at 15:45
  • @Tomalak yes it is I am checking this option :) – J. Doe Oct 09 '18 at 15:46
  • @Tomalak are quantifiers accepted in the lookbehind pattern for this package ? – J. Doe Oct 09 '18 at 15:48
  • 1
    The linked answer suggest so. – Tomalak Oct 09 '18 at 15:48
  • 1
    @J.Doe yes they are: "A lookbehind can match a variable-length string." https://pypi.org/project/regex/ – wp78de Oct 09 '18 at 15:49
  • Try `pattern = re.compile(r"(?:(php|python).*?)?((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)")` and then `aa["test"].apply(lambda x: pattern.search(x).group(2) if pattern.search(x) and not pattern.search(x).group(1) else "")`. – Wiktor Stribiżew Oct 09 '18 at 19:02

2 Answers2

1

You may use a hack that consists in capturing php or python before the expected match, and if the group is not empty (if it matched), discard the current match, else, the match is valid.

See

pattern = re.compile(r"(?:(php|python).*?)?((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)")

The pattern contains 2 capturing groups:

  • (?:(php|python).*?)? - the last ? makes this group optional, it matches and captures into Group 1 php or python, and then 0+ chars, as few as possible
  • ((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+) - this is Group 2 that is basically your pattern with no redundand groups.

If Group 1 matches, we need to return an empty result, else, Group 2 value:

def callback(v):
    m = pattern.search(v)
    if m and not m.group(1):
        return m.group(2)
    return ""

aa["test"].apply(lambda x: callback(x))

Result:

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

As negative lookbehind must be of fixed length, you have to use negative lookahead, anchored to the start of string, checking the part before the first digit.

It should include:

  • A sequence of non-digits (possibly empty).
  • Either of your "forbidden" strings.

This way, if the string to check contains python or php before the first digit, this lookahead will fail, preventing this string from further processing.

Because of the ^ anchor, the rest of regex must first match a sequence of non-digits (what is before "DD+" part) and then there should be your regex.

So the regex to use is as follows:

^(?!\D*(?:python|php))\D*(\d+)\s?([^\W\d_]+)\s?(\d{2}\s?\d{3})\s?(\w+)

Details:

  • ^(?! - Start of string and negative lookahead for:
    • \D* - A sequence of non-digits (may be empty).
    • (?:python|php) - Either of the "forbidden" strings, as a non-capturing group (no need to capture it).
  • ) - End of negative lookahead.
  • \D* - A sequence of non-digits (before what you want to match).
  • (\d+)\s? - The first sequence of digits + optional space.
  • ([^\W\d_]+)\s? - Some text No 1 + optional space.
  • (\d{2}\s?\d{3})\s? - The second sequence of digits (with optional space in the middle) + optional space.
  • (\w+) - Some text No 2.

The advantage of my solution over the other is that you are free from checking whether the first group matched. Here you get only "positive" cases, which do not require any check.

For a working example see https://regex101.com/r/gl9nWx/1

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • 1
    I also thought of this approach, but what if there is a digit before the expected match? `1otherword 45 python 50000 sm` would not match, but it seems valid. See [demo](https://regex101.com/r/obUVqM/2). – Wiktor Stribiżew Oct 10 '18 at 07:09