1

I am trying to extract tokens from a string, such that these tokens meet certain conditions. In my particular case, I want to extract symbols such as +,=,-, etc.

I have created the following regex:

reg = re.compile(r"[\{\}\(\)\[\]\.,;\+\-\*\/\&\|<>=~]")

However, when I apply:

reg.findall('x += "hello + world"')

It also matches the + between quotes, so it outputs:

['+', '=', '+']

My expected output is:

['+', '=']

My question is, how do I achieve this? Is it even possible? I have been surfing on the internet, but only found how to match everything but double quotes, and the ones like that.

dpalma
  • 500
  • 5
  • 20
  • It's not possible to do it in a single pass. You'd need to eliminate all the quoted segments first (and deal correctly with any nested quotes). But what is the purpose of this? It looks like you're trying to parse source code, or perhaps arithmetical statements. – ekhumoro Oct 09 '17 at 18:46
  • Indeed I am trying to do lexical analysis to a source code, so I just want the tokens, in this case tokens of the type symbol, that I defined. The problem is, when a string is defined, I do not know how to handle that. My guess is playing with groups, but I dont know if it is the right path... – dpalma Oct 09 '17 at 18:47
  • What language is the source code? Use [tokenize](https://docs.python.org/2/library/tokenize.html#module-tokenize) if it's python. – ekhumoro Oct 09 '17 at 18:49

2 Answers2

1

First, you do not need to escape every special character in a character class (letting aside [ and ]). So your initial expression becomes sth. like:

[-\[\]{}().,;+*/&|<>=~]

Now to the second requirement: matching in certain positions (and leaving some as they are). Here, you could either use the newer regex module and write (demo on regex101.com):

"[^"]+"(*SKIP)(*FAIL)|[-\[\]{}().,;+*/&|<>=~]


Or use parentheses with the older re module and some programming logic:
import re

rx = re.compile(r'"[^"]+"|([-\[\]{}().,;+*/&|<>=~])')

string = 'x += "hello + world"'

symbols = [match.group(1) for match in rx.finditer(string) if match.group(1)]
print(symbols)


Both will yield
['+', '=']


These approaches follow the mechanism:
match_this_but_dont_save_it | (keep_this)

You might want to read more on (*SKIP)(*FAIL) here.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

I think you can do one thing you can limit that once

"

Will come it will not check the regex until another occurance of

"

Comes

Aniruddh Agarwal
  • 900
  • 1
  • 7
  • 22