8

I'm trying to capture parts of a multi-lined string with a regex in Scala. The input is of the form:

val input = """some text
              |begin {
              |  content to extract
              |  content to extract
              |}
              |some text
              |begin {
              |  other content to extract
              |}
              |some text""".stripMargin

I've tried several possibilities that should get me the text out of the begin { } blocks. One of them:

val Block = """(?s).*begin \{(.*)\}""".r

input match {
  case Block(content) => println(content)
  case _ => println("NO MATCH")
}

I get a NO MATCH. If I drop the \} the regex looks like (?s).*begin \{(.*) and it matches the last block including the unwanted } and "some text". I checked my regex at rubular.com as with /.*begin \{(.*)\}/m and it matches at least one block. I thought when my Scala regex would match the same I could start using findAllIn to match all blocks. What am I doing wrong?

I had a look at Scala Regex enable Multiline option but I could not manage to capture all the occurrences of the text blocks in, for example, a Seq[String]. Any help is appreciated.

Community
  • 1
  • 1
Thomas Rawyler
  • 1,075
  • 7
  • 16

3 Answers3

11

As Alex has said, when using pattern matching to extract fields from regular expressions, the pattern acts as if it was bounded (ie, using ^ and $). The usual way to avoid this problem is to use findAllIn first. This way:

val input = """some text
              |begin {
              |  content to extract
              |  content to extract
              |}
              |some text
              |begin {
              |  other content to extract
              |}
              |some text""".stripMargin

val Block = """(?s)begin \{(.*)\}""".r

Block findAllIn input foreach (_ match {
  case Block(content) => println(content)
  case _ => println("NO MATCH")
})

Otherwise, you can use .* at the beginning and end to get around that restriction:

val Block = """(?s).*begin \{(.*)\}.*""".r

input match {
  case Block(content) => println(content)
  case _ => println("NO MATCH")
}

By the way, you probably want a non-eager matcher:

val Block = """(?s)begin \{(.*?)\}""".r

Block findAllIn input foreach (_ match {
  case Block(content) => println(content)
  case _ => println("NO MATCH")
})
Community
  • 1
  • 1
Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
  • Do you know if this is documented anywhere? – Alex Neth Nov 07 '09 at 05:43
  • Alex, at this point, I'm not sure. I did so much with Regex, even extending the library, that I can't even recall what the library provides or not! For instance, I was going to write `Block findAllMatchesIn input map (_ group 0)`, when I discovered this method doesn't exist in the library as it is. – Daniel C. Sobral Nov 07 '09 at 19:45
  • 2
    for(Block(content) <- Block findAllIn input) println(content) – Ken Bloom Nov 08 '09 at 03:56
  • 1
    Thanks, Ken. I should have thought of providing at least one for-comprehension example. And doing the pattern match on the LHS is clever. – Daniel C. Sobral Nov 10 '09 at 13:16
1

When doing a match, I believe there is a full match implicity required. Your match is equivalent to:

val Block = """^(?s).*begin \{(.*)\}$""".r

It works if you add .* to the end:

val Block = """(?s).*begin \{(.*)\}.*""".r

I haven't been able to find any documentation on this, but I have encountered this same issue.

Alex Neth
  • 3,326
  • 2
  • 26
  • 36
0

As a complement to the other answers, I wanted to point out the existence of kantan.regex, which lets you write the following:

import kantan.regex.ops._

// The type parameter is the type as which to decode results,
// the value parameters are the regular expression to apply and the group to
// extract data from.
input.evalRegex[String]("""(?s)begin \{(.*?)\}""", 1).toList

This yields:

List(Success(
  content to extract
  content to extract
), Success(
  other content to extract
))
Nicolas Rinaudo
  • 6,068
  • 28
  • 41