0

I need to do the following: Parse XML document that contains SVG (Scalable Vector Graphics) as some of the elements. SVG is itself XML. But i need to extract the SVG part as a whole, not needing to parse the contents of the SVG strings.

Example:

 ...
 <symbol>
   <svg> [arbitrary svg/xml content here ...] </svg>
 </symbol>
 ...

I'd like to parse the document and extract the strings between the symbol tags.

I'm not very familiar with Java XML APIs. Which one would you recommend for the task? DOM, SAX, StAX? And some recipe would be apreciated. I understand the differences between each, no need to explain the basics. But none seems to be perfect for the task, since i need to obtain the XML string.

Scrontch
  • 3,275
  • 5
  • 30
  • 45
  • What did you try already? I think XML API dependns on parsed xml size. Also you can try Jsoup - HTML parser. It can parse XML also, easy to use. – Georgy Gobozov Nov 24 '13 at 23:50
  • 1
    I can't really understand the reason for being put on hold here. The answer proposed below is of the kind of answer that helps me a lot, so why inhibit further useful answers? I can understand that answers may be opinion based here but that's exactly what i'd like to have: Different, well argued opinions on how to solve the task in my particular context. Note that i'm not asking for the 'best' XML API in general, but for the most appropriate for the given task of extracting an XML sub-document, which isn't that trivial imho. – Scrontch Nov 29 '13 at 20:42

1 Answers1

1

As @Georgy said, deciding whether to use DOM, SAX or StAX depends on your XML size. Most of time, using DOM parser would be very simpler and also applicable for most small to mid-sized XML documents. Suppose your document structure is:

<?xml version="1.0" encoding="UTF-8"?>
<rootElement>
    <someElement>
        <symbol>
            <svg>[arbitrary svg/xml content here ...]</svg>
        </symbol>
    </someElement>
</rootElement>

then you can query your document using DOM and XPath API like this:

//    Parsing XML document
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setIgnoringElementContentWhitespace(true);
dbFactory.setNamespaceAware(true);
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
byte[] xmlDATA = yourXMLAsString.getBytes();
ByteArrayInputStream in = new ByteArrayInputStream(xmlDATA);
Document doc = dBuilder.parse(in);

//    Accessing SVG element using XPath
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String xpathQuery = "/rootElement/someElement/symbol/svg";
XPathExpression expr = xpath.compile(xpathQuery);
Node svgNode = (Node) expr.evaluate(doc, XPathConstants.NODE);

If you want to access the svg content as plain text you can use getTextContent() method of retrieved node:

String svgContent = svgNode.getTextContent();
zaerymoghaddam
  • 3,037
  • 1
  • 27
  • 33
  • Thanks, this is the kind of answer i have expected. I tried your solution, but unfortunately, expr.evaluate returns a Null Node and i am unable to figure out why. My code is a bit lengthy, so i wont post it here but i'd be glad if i could send it to you to review it. I don't know if that's possible though since i don't find your email address (and it's probably intentional for security reasons). – Scrontch Nov 29 '13 at 20:37
  • Can you post your full XML document (or at least its schema)? The number one suspect is your XPath expression. It may refer to an invalid address in your document – zaerymoghaddam Nov 30 '13 at 04:41
  • Probably a namespace issue: if the svg is in its normal `http://www.w3.org/2000/svg` namespace then you can't match it with a plain `svg` in an xpath expression, you need to provide a [namespace context](http://stackoverflow.com/questions/6390339/how-to-query-xml-using-namespaces-in-java-with-xpath/6392700#6392700) mapping the uri to a prefix, and use the prefix in your expressions. – Ian Roberts Nov 30 '13 at 08:57
  • If you don't want to write your own NamespaceContext implementation, [Spring has a simple one you can use](http://docs.spring.io/spring-framework/docs/3.2.0.M2/api/org/springframework/util/xml/SimpleNamespaceContext.html) – Ian Roberts Nov 30 '13 at 09:01
  • Ok, so I set dbFactory.setNamespaceAware(false); to get rid of the namespace problems (i hope). Now expr.evaluate returns a Node. However getTextContent returns an empty string. However, i doubt that is the right function anyway, since what i want is not the text content (in a XML DOM sense) of the element (there is none), but the whole XML sub-document, rooted at the node, as a string. I hope someone will understand. I'm currently browsing the doc at http://docs.oracle.com/javase/7/docs/api/org/w3c/dom/Node.html, but there doesn't seem to be a function like this. – Scrontch Nov 30 '13 at 15:33
  • As you said in your question, you "need to obtain the XML string", so you need to call getTexContent. If you like to access the content as a node, you can call getFirstChild, getLastChild or getChildNodes. – zaerymoghaddam Nov 30 '13 at 16:29