So I am writing a simple lexer for a subset of the C language in python. I am using re to match and find all my tokens but am having issues with my string literal token matching. To match my string literal I am using: r'(?<=").*(?=")
I am doing this non-inclusive because I am wanting to match my double quotes as quote tokens and the content inbetween them as a string literal token. It works fine if a string literal is used only once in a line but if I do "hello" int i "what is up"
I end up matching hello
correctly but then matching int i
because it is inbetween double quotes also. How can I prevent this. Right now all of my input is read in at once into one line.
EDIT:
I found out my possible issue. I was using a greedy expression with .*
I switched it to .*?
and it is matching correctly. It started matching as hello" int i "what is up
and that is where discovered it was being greedy. My new regex is : r'(?<=").*?(?=")
Does anyone see any possible conflicts now?