0

I know Scala can split strings on regex's like this simple split on whitespace:

myString.split("\\s+").foreach(println)

What if I want to split on whitespace, accounting for the possibility that there may be a quoted string in the input (which I wish to be treated as 1 thing)?

"""This is a "very complex" test"""

In this example I want the resulting substrings to be:

This
is
a
very complex
test
ashawley
  • 4,195
  • 1
  • 27
  • 40
Greg
  • 10,696
  • 22
  • 68
  • 98

4 Answers4

5

While handling quoted expressions with split can be tricky, doing so with Regex matches is quite easy. We just need to match all non-whitespace character sequences with ([^\\s]+) and all quoted character sequences with \"(.*?)\" (toList added in order to avoid reiteration):

import scala.util.matching._

val text = """This is a "very complex" test"""
val regex = new Regex("\"(.*?)\"|([^\\s]+)")
val matches = regex.findAllMatchIn(text).toList
val words = matches.map { _.subgroups.flatMap(Option(_)).fold("")(_ ++ _) }

words.foreach(println)

/*
This
is
a
very complex
test
*/

Note that the solution also counts quote itself as a word boundary. If you want to inline quoted strings into surrounding expressions, you'll need to add [^\\s]* from both sides of the quoted case and adjust group boundaries correspondingly:

...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*\".*?\"[^\\s]*)|([^\\s]+)")
...
/*
This
is
a
["very complex"]
test
*/

You can also omit quote symbols when inlining a string by splitting a regex group:

...
val text = """This is a ["very complex"] test"""
val regex = new Regex("([^\\s]*)\"(.*?)\"([^\\s]*)|([^\\s]+)")
...
/*
This
is
a
[very complex]
test
*/
P. Frolov
  • 876
  • 6
  • 15
  • 1
    As Wiktor Stribiżew noted, the regex won't handle back-to-back quoted strings. A fix seems to be to add a question mark for greedy: new Regex("\"(.*?)\"|([^\\s]+)") – Greg Apr 28 '17 at 13:11
1

In more complex scenarios, when you have to deal with CSV strings, you'd better use a CSV parser (e.g. scala-csv).

For a string like the one in question, when you do not have to deal with escaped quotation marks, nor with any "wild" quotes appearing in the middle of the fields, you may adapt a known Java solution (see Regex for splitting a string using space when not surrounded by single or double quotes):

val text = """This is a "very complex" test"""
val p = "\"([^\"]*)\"|[^\"\\s]+".r
val allMatches = p.findAllMatchIn(text).map(
    m => if (m.group(1) != null) m.group(1) else m.group(0)
)
println(allMatches.mkString("\n"))

See the online Scala demo, output:

This
is
a
very complex
test

The regex is rather basic as it contains 2 alternatives, a single capturing group and a negated character class. Here are its details:

  • \"([^\"]*)\" - ", followed with 0+ chars other than " (captured into Group 1) and then a "
  • | - or
  • [^\"\\s]+ - 1+ chars other than " and whitespace.

You only grab .group(1) if Group 1 participated in the match, else, grab the whole match value (.group(0)).

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

This should work:

 val xx = """This is a "very complex" test"""
 var x = xx.split("\\s+")
 for(i <-0 until x.length) {
   if(x(i) contains "\"") {
     x(i) = x(i) + " " + x(i + 1)
     x(i + 1 ) = ""
   }
 }
 val newX=  x.filter(_ != "")
 for(i<-newX) {
  println(i.replace("\"",""))
 }
Zzrot
  • 304
  • 2
  • 4
  • 20
0

Rather than using split, I used a recursive approach. Treat the input string as a List[Char], then step through, inspecting the head of the list to see if it is a quote or whitespace, and handle accordingly.

def fancySplit(s: String): List[String] = {
    def recurse(s: List[Char]): List[String] = s match {
        case Nil => Nil
        case '"' :: tail =>
            val (quoted, theRest) = tail.span(_ != '"')
            quoted.mkString :: recurse(theRest drop 1)
        case c :: tail if c.isWhitespace => recurse(tail)
        case chars =>
            val (word, theRest) = chars.span(c => !c.isWhitespace && c != '"')
            word.mkString :: recurse(theRest)
    }
    recurse(s.toList)
}
  • If the list is empty, you've finished recursion
  • If the first character is a ", grab everything up to the next quote, and recurse with what's left (after throwing out that second quote).
  • If the first character is whitespace, throw it out and recurse from the next character
  • In any other case, grab everything up to the next split character, then recurse with what's left

Results:

scala> fancySplit("""This is a "very complex" test""") foreach println
This
is
a
very complex
test
Dylan
  • 13,645
  • 3
  • 40
  • 67
  • 1
    "Treat the input string as a List[Char]" This has really, really, really, really large memory overhead. – Alexey Romanov Apr 27 '17 at 21:30
  • True, not a good idea if your input is huge. But the concept stands: if your input is large, you could do operations in terms of "charAt" rather than List ops. – Dylan Apr 28 '17 at 13:25