3

Typing this in scala (pattern matching with a regexp to find the value of the id field

val str = """<path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>"""

val Idpattern = """.*id="([^"]*)"(?:[\n\r\t]|.)*""".r

str match {
  case Idpattern(id) => id
  case _ => "no id"
}

Yields the following exception trace:

at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$Branch.match(Pattern.java:4502)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4466)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
...

How can I overcome this problem? I could try parsing xml with a library but I don't need something so obfuscated. I thought regexp could be fast and reliable.

Mikaël Mayer
  • 10,425
  • 6
  • 64
  • 101
  • 1
    You might not think you need the complexity of an xml parser, but unless you control your inputs carefully, you could find issues. For example, if your input includes the string "id=" somewhere (like this comment does), it could break unexpectedly. – kmorris Sep 14 '13 at 18:42
  • It might not be able to recognise which is the right `id` to extract the value from. The `id` in kmorris' comment won't break it though, but something like `id=""` in the string (after the first `id="basarbre"`) will. – Jerry Sep 14 '13 at 18:55

3 Answers3

5

Actually scala provides native xml manipulation. So if you remove the """ at the beginning and end of str, it will become a NodeSeq that you can easily manipulate, like:

val str = <path sodipodi:nodetypes="csszsscsscssssscssssscc" inkscape:connector-curvature="0" id="basarbre" d="M 111.11111,111.11111 C 101.11111,111.1001 111.11111,111.11111 111.1011,101.01111 111.11111,111.1111 111.11111,110.11111 111.10111,111.11101 110.01111,111.11111 110.11111,111.11101 111.11111,111.01111 110.11111,111.1111 101.11111,111.10111 111.11111,111.11111 111.11111,101.11111 111.11111,111.11111 111.11111,111.11111 111.11111,111.11101 111.11111,101.11111 111.11111,101.11111 111.11111,101.11111 111.111,111.11101 101.01111,110.11111 111.11111,111.11111 101.1111,111.11111 101.11101,110.11111 111.10111,110.11101 101.11111,111.11111 101.11111,111.11111 101.11111,111.11111 111.11111,110.1111 111.10111,111.11111 111.11011,111.11111 111.11101,111.11111 111.01111,111.11111 110.11111,111.11111 111.11111,111.11111 110.01111,111.11111 111.11111,111.11111 111.11111,111.11111 111.01111,101.11111 111.11111,111.11101 110.11011,110.11111 101.11111,111.01111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.11111 11.111111,111.1111 10.111111,111.11111 11.111111,101.11111 11.010111,100.11111 11.111111,110.11111 11.111111,110.11111 11.111111,111.11111 11.111111,111.11111 11.010111,111.1111 11.101111,111.01111 11.11011,101.11111 -11.111111,110.11111 11.011111,111.11111 11.111111,111.10101 11.11111,111.11111 111.11101,111.01011 111.11101,111.01011 z" style="fill:#511b00;fill-opacity:1;stroke:none" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" xmlns:svg="http://www.w3.org/2000/svg" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:osb="http://www.openswatchbook.org/uri/2009/osb"/>

val idAttribute = str \\ "@id"     

val  id = if (idAttribute.isEmpty) "no id" else idAttribute.text

You can read more here

Chirlo
  • 5,989
  • 1
  • 29
  • 45
  • Nice trick. I knew this possibility but thank you for reminding. – Mikaël Mayer Sep 14 '13 at 20:51
  • Not really a trick. A recent answer elsewhere referenced the notorious http://stackoverflow.com/a/1732454/1296806 . And see http://stackoverflow.com/q/590747/1296806 . But core xml has been split into a subproject; maybe all the more reason to pick your xml poison. – som-snytt Sep 14 '13 at 21:31
2

For a task like this, its better to write a regex that only matches part of the string:

scala> val Idpattern = """id="([^"]*)"""".r
scala> Idpattern.findFirstMatchIn(str).map(_.group(1))
res10: Option[String] = Some(basarbre)

This way, the regex engine can start by scanning through the string for an 'i'. With your original regex, the greedy .* will match the entire string, and then the regex engine will start backtracking from the end. As for why your regex blew the stack, I think this might be a problem with Java's handling of the alternation at the end of the regex, but I'm not really sure. The shorter regex gives less opportunity for recursion.

wingedsubmariner
  • 13,350
  • 1
  • 27
  • 52
2

Here is the correction to the regex, where you are trying to consume line endings. The (?s) turns on DOTALL so dot matches it.

scala> val Idpattern = """.*id="([^"]*)"(?s).*""".r
Idpattern: scala.util.matching.Regex = .*id="([^"]*)"(?s).*

scala> str match { case Idpattern(id) => id }
res6: String = basarbre

Here's the better way to find the pattern in Scala:

scala> val Idpattern = """ id="([^"]*)" """.r.unanchored
Idpattern: scala.util.matching.UnanchoredRegex =  id="([^"]*)" 

scala> str match { case Idpattern(id) => id }
res7: String = basarbre
som-snytt
  • 39,429
  • 2
  • 47
  • 129