Regex to match string literal enclosed in quotes without matching others

Question

So I am writing a simple lexer for a subset of the C language in python. I am using re to match and find all my tokens but am having issues with my string literal token matching. To match my string literal I am using: r'(?<=").*(?=") I am doing this non-inclusive because I am wanting to match my double quotes as quote tokens and the content inbetween them as a string literal token. It works fine if a string literal is used only once in a line but if I do "hello" int i "what is up" I end up matching hello correctly but then matching int i because it is inbetween double quotes also. How can I prevent this. Right now all of my input is read in at once into one line.

EDIT: I found out my possible issue. I was using a greedy expression with .* I switched it to .*? and it is matching correctly. It started matching as hello" int i "what is up and that is where discovered it was being greedy. My new regex is : r'(?<=").*?(?=") Does anyone see any possible conflicts now?

score 3 · Accepted Answer · answered Jan 31 '18 at 01:29

3

Instead of using a lookahead, you could try this (which will consume the last " so that it does not start a new capture):

import re
text = '"hello"  int i "what is up"'
print(re.findall(r'"(.*?)"', text))
# ['hello', 'what is up']

answered Jan 31 '18 at 01:29

nicolas

3,120
2
15
17

I found this out too but I am wanting to not consume. I think I have come up with the solution. I will edit. – Reaperr Jan 31 '18 at 01:35

Regex to match string literal enclosed in quotes without matching others

1 Answers1