6

My question is the same as Split string including regular expression match but for Scala. Unfortunately, the JavaScript solution doesn't work in Scala.

I am parsing some text. Let's say I have some string:

"hello wold <1> this is some random text <3> foo <12>"

I would like to get the following Seq: "hello world" :: "<1>" :: "this is some random text" :: "<3>" :: "foo" :: "<12>".

Note that I am spliting the string whenever I encounter a <"number"> sequence.

Community
  • 1
  • 1
Noel Yap
  • 18,822
  • 21
  • 92
  • 144

2 Answers2

6
val s = "hello wold <1> this is some random text <3> foo <12>"
s: java.lang.String = hello wold <1> this is some random text <3> foo <12>

s.split("""((?=<\d{1,3}>)|(?<=<\d{1,3}>))""")
res0: Array[java.lang.String] = Array(hello wold , <1>,  this is some random text , <3>,  foo , <12>)

Did you actually try out your edit? Having \d+ doesn't work. See this question.

s.split("""((?=<\d+>)|(?<=<\d+>))""")
java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length near index 19
Community
  • 1
  • 1
Akos Krivachy
  • 4,915
  • 21
  • 28
  • 1
    Can you explain what the `?=` and `?<=` are doing or point me to a page that does? – Noel Yap Nov 14 '13 at 01:03
  • 1
    Sure, they're called lookarounds, part of regular expressions. You can read more about them here: http://www.rexegg.com/regex-lookarounds.html – Akos Krivachy Nov 14 '13 at 01:12
  • 1
    No problem, we're here to learn. I ran into the same problem and had to research the issue also. ;) You can quickly try out Scala code blocks with these online tools: http://www.simplyscala.com/ and http://www.compileonline.com/compile_scala_online.php – Akos Krivachy Nov 14 '13 at 10:12
  • I don't understand, what part of the above code causes the delimiter not to disappear? Is it the fact that the regular expression is matching lookahead and lookbehind? – ErikAGriffin Nov 13 '15 at 14:22
1

Here's a quick, but a little hacky solution:

scala> val str = "hello wold <1> this is some random text <3> foo <12>"
str: String = hello wold <1> this is some random text <3> foo <12>

scala> str.replaceAll("<\\d+>", "_$0_").split("_")
res0: Array[String] = Array("hello wold ", <1>, " this is some random text ", <3>, " foo ", <12>)

Of course, the problem with this solution is that I gave the underscore character a special meaning. If it occurs naturally in the original string, you'll get bad results. So you have to either choose another magic character sequence for which you are sure that it won't occur in the original string or play with some more escaping/unescaping.

Another solution involves usage of lookahead and lookbehind patterns, as described in this question.

Community
  • 1
  • 1
ghik
  • 10,706
  • 1
  • 37
  • 50