It's a pretty standard scanning exercise. Depending on how close you intend to be to the LOLCODE specification (which I can't seem to reach right now, so this is from memory), you've got a few ways to go.
Write a lexer by hand
It's not as hard as it sounds. You just want to analyze your input one character at a time, while maintaining a bit of context information. In your case, the important context consists of two flags:
- one to remember you're currently lexing a string. It'll be set when reading
"
and cleared when reading "
.
- one to remember the previous character was an escape. It'll be set when reading
\
and cleared when reading the character after that, no matter what it is.
Then the general algorithm looks like: (pseudocode)
loop on: c ← read next character
if not inString
if c is '"' then clear buf; set inString
else [out of scope here]
if inEscape then append c to buf; clear inEscape
if c is '"' then return buf as result; clear inString
if c is '\' then set inEscape
else append c to buf
You might want to refine the inEscape
case should you want to implement \r
, \n
and the like.
Use a lexer generator
The traditional tools here are lex and flex.
Get inspiration
You're not the first one to write a LOLCODE interpreter. There's nothing wrong with peeking at how the others did it. For example, here's the string parsing code from lci.