In this answer I focus on the errors in your code and try to explain how you could make it work. As explained by @Svante, this might not be the best course of actions for your use-case. In particular, your regex might be too tailored for your known test inputs and might miss cases that could arise later.
For example, your regex consider fields as either strings delimited by double-quotes with no inner double-quotes (even escaped), or a sequence of characters different from the comma. If, however, your field starts with a normal letter and then contains a double quote, it will be part of the field name.
Fixing the test string
Maybe there was a problem when formatting your question, but the form introducing bads
is malformed.
Here is a fixed definition for *bads*
(notice the asterisks around the special variable, this is a useful convention that helps distinguish them from lexical variables (asterisks around the names are also known as "earmuffs")):
(defparameter *bads*
"\"AER\",\"BenderlyZwick\",\"Benderly and Zwick Data: Inflation, Growth and Stock returns\",31,5,0,0,0,0,5,\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\",\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\"")
Escape characters in regex
The parse tree you obtain contains this:
(... (:GREEDY-REPETITION 0 NIL #\s) ...)
There is a literal character #\s
in your parse-tree. To understand why, let's define two auxiliary functions:
(defun chars (string)
"Convert a string to a list of char names"
(map 'list #'char-name string))
(defun test (s)
(list :parse (chars s)
:as (ppcre:parse-string s)))
For example, here is how the different strings below are parsed:
(test "s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)
(test "\s")
=> (:PARSE ("LATIN_SMALL_LETTER_S") :AS #\s)
(test "\\s")
=> (:PARSE ("REVERSE_SOLIDUS" "LATIN_SMALL_LETTER_S")
:AS :WHITESPACE-CHAR-CLASS)
Only in the last case, where the backslash (reverse solidus) is escaped, the PPCRE parser sees both this backslash and the next character #\s
and interprets this sequence as :WHITESPACE-CHAR-CLASS
. The Lisp reader interprets \s
as s
, because it is not part of the characters that can be escaped in Lisp.
I tend to work with parse tree directly because a lot of headaches w.r.t. escaping goes away (and in my opinion this is exacerbated with \Q and \E). A fixed parse tree is for example the following one, where I replaced the #\s
by the desired keyword and removed the :register
nodes that were not useful:
(:sequence
(:alternation
(:sequence #\"
(:greedy-repetition 1 nil
(:inverted-char-class #\"))
#\")
(:greedy-repetition 1 nil (:inverted-char-class #\,)))
(:greedy-repetition 0 1
(:group
(:sequence #\,
(:greedy-repetition 0 nil :whitespace-char-class)))))
Why the result is NIL
Remember that you are trying to split
the string with this regex, but the regex actually describes a field and the following comma. The reason you have a NIL result is because your string is just a sequence of separators, like this example:
(split #\, ",,,,,,")
NIL
With a simpler example, you can see that splitting words as separators give:
(split "[a-z]+" "abc0def1z3")
=> ("" "0" "1" "3")
But if the separators also include digits, then the result is NIL:
(split "[a-z0-9]+" "abc0def1z3")
=> NIL
Looping over fields
With the regex you defined, it is easier to use do-register-groups
. It is a loop construct that iterates over the string by trying to match the regex successively on the string, binding each (:register ...)
in the regex to a variable.
If you put (:register ...)
around the first (:alternation ...)
, you will sometimes capture the double quotes (first branch of the alternation):
(do-register-groups (field)
('(:SEQUENCE
(:register
(:ALTERNATION
(:SEQUENCE #\"
(:GREEDY-REPETITION 1 NIL
(:INVERTED-CHAR-CLASS #\"))
#\")
(:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
(:GREEDY-REPETITION 0 1
(:GROUP
(:SEQUENCE #\,
(:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
*bads*)
(print field))
"\"AER\""
"\"BenderlyZwick\""
"\"Benderly and Zwick Data: Inflation, Growth and Stock returns\""
"31"
"5"
"0"
"0"
"0"
"0"
"5"
"\"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv\""
"\"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html\""
Another option is to add two :register
nodes, one for each branch of the alternation; that means binding two variables, one of them being NIL for each successful match:
(do-register-groups (quoted simple)
('(:SEQUENCE
(:ALTERNATION
(:SEQUENCE #\"
(:register ;; <- quoted (first register)
(:GREEDY-REPETITION 1 NIL
(:INVERTED-CHAR-CLASS #\")))
#\")
(:register ;; <- simple (second register)
(:GREEDY-REPETITION 1 NIL (:INVERTED-CHAR-CLASS #\,))))
(:GREEDY-REPETITION 0 1
(:GROUP
(:SEQUENCE #\,
(:GREEDY-REPETITION 0 NIL :whitespace-char-class)))))
*bads*)
(print (or quoted simple)))
"AER"
"BenderlyZwick"
"Benderly and Zwick Data: Inflation, Growth and Stock returns"
"31"
"5"
"0"
"0"
"0"
"0"
"5"
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/BenderlyZwick.csv"
"https://vincentarelbundock.github.io/Rdatasets/doc/AER/BenderlyZwick.html"
Inside the loop you could push
each field into a list or a vector to be processed later.