1

So I am writing a simple lexer for a subset of the C language in python. I am using re to match and find all my tokens but am having issues with my string literal token matching. To match my string literal I am using: r'(?<=").*(?=") I am doing this non-inclusive because I am wanting to match my double quotes as quote tokens and the content inbetween them as a string literal token. It works fine if a string literal is used only once in a line but if I do "hello" int i "what is up" I end up matching hello correctly but then matching int i because it is inbetween double quotes also. How can I prevent this. Right now all of my input is read in at once into one line.

EDIT: I found out my possible issue. I was using a greedy expression with .* I switched it to .*? and it is matching correctly. It started matching as hello" int i "what is up and that is where discovered it was being greedy. My new regex is : r'(?<=").*?(?=") Does anyone see any possible conflicts now?

Reaperr
  • 77
  • 9

1 Answers1

3

Instead of using a lookahead, you could try this (which will consume the last " so that it does not start a new capture):

import re
text = '"hello"  int i "what is up"'
print(re.findall(r'"(.*?)"', text))
# ['hello', 'what is up']
nicolas
  • 3,120
  • 2
  • 15
  • 17
  • I found this out too but I am wanting to not consume. I think I have come up with the solution. I will edit. – Reaperr Jan 31 '18 at 01:35