Split on space while ignoring escaped quotes

Question

I would like to take a string representing options to a spark-submit command and format them with --conf interspersed between the options. This

concatConf :: String -> String
concatConf = foldl (\acc c -> acc ++ " --conf " ++ c) "" . words

works for most collections of options, e.g.,

λ => concatConf "spark.yarn.memoryOverhead=3g spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"
" --conf spark.yarn.memoryOverhead=3g --conf spark.default.parallelism=1000 --conf spark.yarn.executor.memoryOverhead=2000"

But on occasion there can be spark.executor.extraJavaOptions, which is a space-separated, escaped-quote enclosed, list of additional options; for example,

"spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"

and the concatConf function above obviously breaks down.

The following function, using the regex-compat library works for this example

import Data.Monoid (<>)
import Text.Regex (mkRegex, matchRegexAll)

concatConf :: String -> String
concatConf conf = let regex = mkRegex "(\\ *.*extraJavaOptions=\\\".*\\\")"
                  in case matchRegexAll regex conf of
                    Just (x, y, z, _) -> (insConf x) <> " --conf " <> y <> (insConf z)
                    Nothing           -> ""
                  where insConf = foldl (\acc c -> acc ++ " --conf " ++ c) "" . words

until you figure out that there's a similar spark.driver.extraJavaOptions that comes in a similar format. In any case, this function doesn't work for when there isn't such an option. Now I'm struggling with many cases: where there is none or one or both of these, which one appears first in the string if it's there, etc.

This sort of makes me feel like regex isn't the right tool for the job, hence my question, what is the right tool for this job?

I think I'd write a modified `words` that keeps track of when it sees quote characters, and doesn't break the string when it's between them. — DarthFennec, May 09 '18 at 18:59
Your function is fine, except use `foldl1` instead of `foldl` and remove the `""` just before `. words`. The initial value is what is causing the extra `" --conf " at the beginning. — fp_mora, May 10 '18 at 16:36
@fp_mora Thanks, but that's not the issue (and in fact, the "--conf" at the beginning is desired). The issue is that `words` splits the options inside of the escaped quotes, which is not desired. See the answer of @wp78de below, which seems like it would work, except for the fact that the regex engine I'm using (`regex-compat`) doesn't seem to like the `++` and `?:` symbols. — user4601931, May 10 '18 at 16:39

score 1 · Accepted Answer · answered May 10 '18 at 05:42

1

A split is not the right weapon of choice here. Inspired by Jan Goyvaerts answer here I suggest substituting a match pattern instead that:

matches characters that aren't spaces or quotes,
and 1) followed by characters that begin and end with a quote, with no quotes in between.

[^\s"]+|\s[^\s"]++"(?:[^"]*)"\s

Output after substitution: --conf $0

 --conf spark.yarn.memoryOverhead=3g --conf  spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC
-XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\"  --conf spark.default.parallelism=1000  --conf spark.yarn.executor.memoryOverhead=2000

Demo

I hope, this is useful to you.

Note: There are some unnecessary spaces in the output since I had to add surrounding spaces to the second pattern. I haven't treated them since it would make the regex even more complicated and your CLI app won't complain, I guess.

answered May 10 '18 at 05:42

wp78de

18,207
7
43
71

Thank you for your answer; this is really promising. I had been trying to use `subRegex` but could not get the right pattern. I had to escape the backslashes and quotes in your pattern to get it to not complain about "lexical errors", but now I'm getting "repetition-operator operand invalid". E.g., with `pattern = "[^\\s\"]+|\\s[^\\s\"]++\"(?:[^\"]*)\"\\s"` and `x = "spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"` – user4601931 May 10 '18 at 14:49
... I write `subRegex (mkRegex pattern) x "--conf \0"` and get the following error: ""*** Exception: user error (Text.Regex.Posix.String died: (ReturnCode 13,"repetition-operator operand invalid"))" Any idea which operand it's talking about? – user4601931 May 10 '18 at 14:50
@user4601931 Sorry, I cannot test this end-to-end since I don't code in Haskell and there are too many regex flavors. Try a single `+` instead of the `++`. – wp78de May 10 '18 at 18:12
Thanks for your reply, understandable as this now seems less like a Haskell question and more like a regex question. Changing `++` to `+` works, but it doesn't like `?:`. Removing this symbol altogether works (as in, it doesn't error out), but now it doesn't actually match anything it seems. Which is strange, given what I now understand `?:` to mean. – user4601931 May 10 '18 at 20:55
Have you tried it like that: `[^\s"]+|\s[^\s"]+"[^"]*"\s` ? I am sorry, there is not much else I can do. – wp78de May 11 '18 at 00:56
1

This is the one that ended up working: `[^\\ \"]+| [^\\ \"]+\"([^\"]*)\"`. Thanks for putting me on the right track! – user4601931 May 11 '18 at 04:00

fp_mora · Answer 2 · 2018-05-12T18:38:42.163

I have a useless but interesting partial solution. It assembles strings that are assembled by words. It is a 'scanl1' function so outputs an partially assembled string before the final string.

strl = "spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"

Now the function

t = scanl1 (\acc l -> if (take 5 l) == "spark" then "--conf " ++ l else  acc ++ l ) $ words strl

Now the results with three excess records

t !! 0
"spark.yarn.memoryOverhead=3g"
t !! 1
"spark.executor.extraJavaOptions=\"-verbose:gc"
t !! 2
"spark.executor.extraJavaOptions=\"-verbose:gc-XX:+UseSerialGC"
t !! 3
"spark.executor.extraJavaOptions=\"-verbose:gc-XX:+UseSerialGC-XX:+PrintGCDetails"
t !! 4
"spark.executor.extraJavaOptions=\"-verbose:gc-XX:+UseSerialGC-XX:+PrintGCDetails-XX:+PrintAdaptiveSizePolicy\""
t !! 5
"spark.default.parallelism=1000"
t !! 6
"spark.yarn.executor.memoryOverhead=2000"

The only thing going for this is it assembles correctly if excessively. I added the --conf after I ran it otherwise it would appear before each line

score 0 · Answer 3 · answered Dec 11 '20 at 01:21

This sort of makes me feel like regex isn't the right tool for the job, hence my question, what is the right tool for this job?

The right tool for this job is monadic parsers.

{-# LANGUAGE TypeFamilies #-}

import Text.Megaparsec
import Text.Megaparsec.Char
import Replace.Megaparsec
import Data.Void
import Data.Either

-- | Invert a single-token parser “character class”.
-- | For example, match any single token except a letter or whitespace: 
-- |
-- |     anySingleExcept (letterChar <|> spaceChar)
-- |
anySingleExcept :: (MonadParsec e s m, Token s ~ Char) => m (Token s) -> m (Token s)
anySingleExcept p = notFollowedBy p *> anySingle

nonSpaceQuoted :: Parsec Void String String
nonSpaceQuoted = 
  ((chunk "\\\"") *> manyTill anySingle (chunk "\\\"")) -- match anything between escaped quotes
  <|> -- or
  (pure <$> anySingleExcept spaceChar) -- match anything that's not a space

wordsQuoted :: Parsec Void String String
wordsQuoted = fst <$> match (some nonSpaceQuoted)

input = "spark.yarn.memoryOverhead=3g spark.executor.extraJavaOptions=\\\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\\\" spark.default.parallelism=1000 spark.yarn.executor.memoryOverhead=2000"

putStrLn $ unlines $ fmap ("--conf " <>) $ rights $ splitCap wordsQuoted input

Here's the output, printed with unlines instead of unwords for clarity:

--conf spark.yarn.memoryOverhead=3g
--conf spark.executor.extraJavaOptions=\"-verbose:gc -XX:+UseSerialGC -XX:+PrintGCDetails -XX:+PrintAdaptiveSizePolicy\"
--conf spark.default.parallelism=1000
--conf spark.yarn.executor.memoryOverhead=2000

Split on space while ignoring escaped quotes

3 Answers3