0

I am trying to pull a certain number from various strings. The number has to be standalone, before ', or before (. The regex I came up with was: \b(?<!\()(x)\b(,|\(|'|$) <- x is the numeric number.

If x is 2, this pulls the following string (almost) fine, except it also pulls 2'abd'. Any advice what I did wrong here?

2(2'Abf',3),212,2'abc',2(1,2'abd',3)

wp78de
  • 18,207
  • 7
  • 43
  • 71
Jia Liu
  • 1
  • 1
  • Here's the regex pattern: `\b(?<!\()(2)\b(,|\(|'|$)`. Here's the to be searched string `2(2'Abf',3),212,2'abc',2(1,2'abd',3)`. The found results are shown in **bold**: `**2**(2'Abf',3),212,**2**'abc',**2**(1,**2**'abd',3)` The last `**2**'abd'` was not to be found. Thanks! – Jia Liu May 27 '18 at 23:54
  • Ok, and which numbers do you want to extract now? I'm not sure what you mean by "pull the string", to be honest. – Andrey Tyukin May 27 '18 at 23:55
  • Yes, it's not pulling string but identifying a number from a string.The above was an example. The goal is to find if a number exists in a string with the criteria stated. In the example, we want to identify 2. However, the regex above will find 3(1,2'abd',3). – Jia Liu May 28 '18 at 00:04

1 Answers1

0

Your actual question is, as I understand it, get these specific number except those in parenthesis.

To do so I suggest using the skip_what_to_avoid|what_i_want pattern like this:

(\((?>[^()\\]++|\\.|(?1))*+\))
|\b(2)(?=\b(?:,|\(|'|$))

The idea here is to completely disregard the overall matches (and there first group use for the recursive pattern to capture everything between parenthesis: (\((?>[^()\\]++|\\.|(?1))*+\))): that's the trash bin. Instead, we only need to check capture group $2, which, when set, contains the asterisks outside of comments.

Demo

Sample Code:

import regex as re

regex = r"(\((?>[^()\\]++|\\.|(?1))*+\))|\b(2)(?=\b(?:,|\(|'|$))"
test_str = "2(2'Abf',3),212,2'abc',2(1,2'abd',3)"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1
    if match.groups()[1] is not None:
        print ("Found at {start}-{end}: {group}".format(start = match.start(2), end = match.end(2), group = match.group(2)))

Output:

Found at 0-1: 2
Found at 16-17: 2
Found at 23-24: 2

This solution requires the alternative Python regex package.

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • I'm still missing something. I'm using regex101 site. https://regex101.com/r/7SzHVh/7. Thank you very much. – Jia Liu May 28 '18 at 01:01
  • What are you missing? This is a PCRE pattern. It does not work with an online tester that uses Python's standard re package. Executable online [demo](https://repl.it/repls/StunningAnimatedCryptos). – wp78de May 28 '18 at 01:05
  • Many thanks for the help. I probably shouldn't have used Python tag. The primary driver is to use regex to pull data from Denodo REST Service and consume in Angular. The regex helps other Python works as well. I ended up using: `\b(2)(\s|,|\([\d,\s]*|'.*?'|$)`, which may not be perfect, but it seemed to do what we want, – Jia Liu May 29 '18 at 01:06