3

I've a rather simple regex that works perfectly fine in my Ruby code but refuses to work in my Lisp code. I'm just trying to match a URL (slash followed by a word, and no more). Here's the regex I have that works in Ruby: ^\/\w*$

I'd like this to match "/" or "/foo" but not "/foo/bar"

I've tried the following:

(cl-ppcre:scan "^/\w*$" "/") ;works
(cl-ppcre:scan "^/\w*$" "/foo") ;doesn't work!
(cl-ppcre:scan "^/\w*$" "/foo/bar") ;works, ie doesn't match

Can someone help?

morgan121
  • 2,213
  • 1
  • 15
  • 33
Sunder
  • 63
  • 4

2 Answers2

9

The backslash (\) character is, by default, the single escape character: It prevents any special processing to be done to the character following it, so it can be used to include a double quote (") inside of a string literal like this "\"".

Thus, when you pass the literal string "^/\w*$" to cl-ppcre:scan, the actual string that is passed will be "^/w*$", i.e. the backslash will just be removed. You can verify this by evaluating (cl-ppcre:scan "^/\w*$" "/w"), which will match.

To include the backslash character in your regular expression, you need to quote it like so: "^/\\w*$".

If you work with literal regular expressions a lot, the required quoting of strings can become tedious and hard to read. Have a look at CL-INTERPOL for a library that adds a nicer syntax for regular expressions to the Lisp reader.

hans23
  • 1,034
  • 7
  • 13
5

If you have a doubt about your regular expression, you can also check it with ppcre:parse-string:

CL-USER> (ppcre:parse-string "^/\w*$")
(:SEQUENCE :START-ANCHOR #\/ (:GREEDY-REPETITION 0 NIL #\w) :END-ANCHOR)

The above tells us that backslash-w was interpreted as a literal w character.

Compare this with the expression you wanted to use:

CL-USER> (ppcre:parse-string "^/\\w*$")
(:SEQUENCE :START-ANCHOR #\/ (:GREEDY-REPETITION 0 NIL :WORD-CHAR-CLASS) :END-ANCHOR)

The returned value is a tree that represents a regular expression. You can in fact use the same representation anywhere CL-PPCRE expects a regular expression. Even though it is somewhat verbose, this helps combining values into regexes, without having to worry about nesting strings or special characters inside strings:

(defun maybe (regex)
  `(:greedy-repetition 0 1 ,regex))

(defparameter *simple-floats*
  (let ((digits '(:register (:greedy-repetition 1 nil :digit-class))))
    (ppcre:create-scanner `(:sequence
                             (:register (:regex "[+-]?"))
                             ,digits
                             ,(maybe `(:sequence "." ,digits))))))

Here above, the dot "." is read literally, not as a regular expression. That means you can match strings like "(^.^)" or "[]" that could be hard to write and read with escaped characters in string-only regexes. You can fall back to regular expressions as strings by using the (:regex "...") expression.

CL-PPCRE has an optimization where constant regular expressions are precomputed, at load time, using load-time-value. That optimization might not be applied if your regular expressions are not trivially constants, so you may want to wrap your own scanners in load-time-value forms. Just ensure that you have the sufficient definitions ready at load-time, like the auxiliary maybe function.

coredump
  • 37,664
  • 5
  • 43
  • 77