1

I am currently trying to extract the tag element < dc:title > from an epub in Java. However, i tried using

doc.getDocumentElement().getElementsByTagName("dc:title")); 

and it only showed 2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl. I would like to know how can I extract < dc:tittle > ?

Here is my code:

File fXmlFile = new File("file directory");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();

System.out.println("1st element :" +  doc.getElementsByTagName("dc");
System.out.println("2nd element :" + doc.getDocumentElement().getElementsByTagName("dc:title"));

System output:

1st element : com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@4f53e9be
2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@e16e1a2

Added Sample Data

<dc:title>
  <![CDATA[someData]]>
</dc:title>
<dc:creator>
  <![CDATA[someData]>
</dc:creator>
<dc:language>someData</dc:language>
Drew
  • 123
  • 10
  • The `dc:` part is a namespace prefix. You should parse the XML document with namespace awareness. Example: https://stackoverflow.com/questions/11644994/parse-xml-with-namespaces-in-java-using-xpath – vanje Jan 22 '18 at 10:40

2 Answers2

0

The method getElementsByTagName(String) is return a List of matching elements (note plural 's'). You then need to specify which element (such as by using .item(index) to access a Node instance) you want to use. Therewith, you can using getNodeValue() on that Node object.

EDITED: because of the CDATA element, rather use Node.getTextContent():

NodeList elems = doc.getElementsByTagName("dc:title");
Node item = elems.item(0);
System.out.println(item.getTextContent());
Andre Albert
  • 1,386
  • 8
  • 17
  • I did test your method but it still doesn't work. For index(0) = NULL , index(1) = java.lang.NullPointerException . I have added some sample data that i wanted to display at my post. – Drew Jan 22 '18 at 09:09
  • Your selectors are looking wrong. "dc" is the namespace prefix and not the tagname. Does it work using: `doc.getDocumentElement().getElementsByTagNameNS("*", "title");` [Namespaced tagnames](https://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Element.html#getElementsByTagNameNS%28java.lang.String,%20java.lang.String%29) ? Also your example is looking not well formed - you should add a common root element. – Andre Albert Jan 22 '18 at 09:21
  • I tried using your suggestion. Still return the same result like `2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@7e0d8db1` The example is just a partial code. It has root element, and the root element is something like this ` <![CDATA[someData]]> ` – Drew Jan 22 '18 at 09:44
  • But are you now able to call `.item(0).getNodeValue()` on the return NodeList object? – Andre Albert Jan 22 '18 at 09:56
  • @Drew - ok, i forgot the CDATA. I edited my Answer, maybe this helps – Andre Albert Jan 22 '18 at 11:45
-1

I would suggest using xpath to get the desired output. Also, refer following link for examples. https://www.journaldev.com/1194/java-xpath-example-tutorial For example:

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//dc:title/text()";
NodeList nodes = (NodeList) xPath.compile(expression).evaluate(doc, XPathConstants.NODESET);
System.out.println(nodes.item(0).getNodeValue());
akshaya pandey
  • 997
  • 6
  • 16