0

So my problem is virtually identical to this previous StackOverflow question, but I'm reasking the question because I don't like the accepted answer.

I've got a file of concatenated XML documents:

<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>
...
<?xml version="1.0" encoding="UTF-8"?>
<someData>...</someData>

I'd like to parse out each one.

As far as I can tell, I can't use scala.xml.XML, since that depends on the one document per file/string model.

Is there a subclass of Parser I can use for parsing XML documents from an input source? Because then I could just do something like many1 xmldoc or some such.

Community
  • 1
  • 1
rampion
  • 87,131
  • 49
  • 199
  • 315
  • This question is a duplicate unless you explain _why_ you don't like the other answers. Stating that there is not a parser of the type you suggested is not enough IMO for a full question/answer. – Rex Kerr Apr 17 '12 at 19:01
  • @RexKerr: Fair point. I find the accepted answer there unacceptable as "breaking on ` – rampion Apr 17 '12 at 20:59

2 Answers2

0

If your concern is safety, you can wrap your chunks with unique tags:

def mkTag = "block"+util.Random.alphanumeric.take(20).mkString
val reader = io.Source.fromFile("my.xml")
def mkChunk(it: Iterator[String], chunks: Vector[String] = Vector.empty): Vector[String] = {
  val (chunk,extra) = it.span(s => !(s.startsWith("<?xml") && s.endsWith("?>"))
  val tag = mkTag
  def tagMe = "<"+tag+">"+chunk.mkString+"</"+tag+">"
  if (!extra.hasNext) chunks :+ tagMe
  else if (!chunk.hasNext) mkChunk(extra, chunks)
  else mkChunk(extra, chunks :+ tagMe)
}
val chunks = mkChunk(reader.getLines())
reader.close
val answers = xml.XML.fromString("<everything>"+chunks.mkString+"</everything>")
// Now take apart the resulting parse

Since you have supplied unique enclosing tags, it is possible that you will have a parse error if someone has embedded a literal XML tag in the middle somewhere, but you won't accidentally get the wrong number of parses.

(Warning: code typed but not checked at all--it's to give the idea, not exactly correct behavior.)

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
0

Ok, I came up with an answer I'm more happy with.

Basically I try to parse the XML using a SAXParser, just like scala.xml.XML.load does, but watch for SAXParseExceptions that indicate that the parser encountered a <?xml in the wrong place.

Then, I grab whatever root element has been parsed already, rewind the input just enough, and restart the parse from there.

// An input stream that can recover from a SAXParseException 
object ConcatenatedXML {
  // A reader that can be rolled back to the location of an exception
  class Relocator(val re : java.io.Reader)  extends java.io.Reader {
    var marked = 0
    var firstLine : Int = 1
    var lineStarts : IndexedSeq[Int] = Vector(0)
    override def read(arr : Array[Char], off : Int, len : Int) = { 
      // forget everything but the start of the last line in the
      // previously marked area
      val pos = lineStarts(lineStarts.length - 1) - marked
      firstLine += lineStarts.length - 1

      // read the next chunk of data into the given array
      re.mark(len)
      marked = re.read(arr,off,len)

      // find the line starts for the lines in the array
      lineStarts = pos +: (for (i <- 0 until marked if arr(i+off) == '\n') yield (i+1))

      marked
    }
    override def close { re.close }
    override def markSupported = false
    def relocate(line : Int, col : Int , off : Int) {
      re.reset
      val skip = lineStarts( line - firstLine ) + col + off
      re.skip(skip)
      marked = 0
      firstLine = 1
      lineStarts = Vector(0)
    }
  }

  def parse( str : String ) : List[scala.xml.Node] = parse(new java.io.StringReader(str))
  def parse( re : java.io.Reader ) : List[scala.xml.Node] = parse(new Relocator(re))

  // parse all the concatenated XML docs out of a file
  def parse( src : Relocator ) : List[scala.xml.Node] = {
    val parser = javax.xml.parsers.SAXParserFactory.newInstance.newSAXParser
    val adapter = new scala.xml.parsing.NoBindingFactoryAdapter

    adapter.scopeStack.push(scala.xml.TopScope)
    try {

      // parse this, assuming it's the last XML doc in the string
      parser.parse( new org.xml.sax.InputSource(src), adapter )
      adapter.scopeStack.pop
      adapter.rootElem.asInstanceOf[scala.xml.Node] :: Nil

    } catch {
      case (e : org.xml.sax.SAXParseException) => {
        // we found the start of another xmldoc
        if (e.getMessage != """The processing instruction target matching "[xX][mM][lL]" is not allowed."""
            || adapter.hStack.length != 1 || adapter.hStack(0) == null){
          throw(e)
        }

        // tell the adapter we reached the end of a document
        adapter.endDocument

        // grab the current root node
        adapter.scopeStack.pop
        val node = adapter.rootElem.asInstanceOf[scala.xml.Node]

        // reset to the start of this doc
        src.relocate(e.getLineNumber, e.getColumnNumber, -6)

        // and parse the next doc
        node :: parse( src )
      }
    }
  }
}

println(ConcatenatedXML.parse(new java.io.BufferedReader(
  new java.io.FileReader("temp.xml")
)))
println(ConcatenatedXML.parse(
  """|<?xml version="1.0" encoding="UTF-8"?>
     |<firstDoc><inner><innerer><innermost></innermost></innerer></inner></firstDoc>
     |<?xml version="1.0" encoding="UTF-8"?>
     |<secondDoc></secondDoc>
     |<?xml version="1.0" encoding="UTF-8"?>
     |<thirdDoc>...</thirdDoc>
     |<?xml version="1.0" encoding="UTF-8"?>
     |<lastDoc>...</lastDoc>""".stripMargin
))
try {
  ConcatenatedXML.parse(
    """|<?xml version="1.0" encoding="UTF-8"?>
       |<firstDoc>
       |<?xml version="1.0" encoding="UTF-8"?>
       |</firstDoc>""".stripMargin
  )
  throw(new Exception("That should have failed"))
} catch {
  case _ => println("catches really incomplete docs")
}
rampion
  • 87,131
  • 49
  • 199
  • 315