How to extract data from tag in java?

Question

I am currently trying to extract the tag element < dc:title > from an epub in Java. However, i tried using

doc.getDocumentElement().getElementsByTagName("dc:title"));

and it only showed 2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl. I would like to know how can I extract < dc:tittle > ?

Here is my code:

File fXmlFile = new File("file directory");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();

System.out.println("1st element :" +  doc.getElementsByTagName("dc");
System.out.println("2nd element :" + doc.getDocumentElement().getElementsByTagName("dc:title"));

System output:

1st element : com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@4f53e9be
2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@e16e1a2

Added Sample Data

<dc:title>
  <![CDATA[someData]]>
</dc:title>
<dc:creator>
  <![CDATA[someData]>
</dc:creator>
<dc:language>someData</dc:language>

The `dc:` part is a namespace prefix. You should parse the XML document with namespace awareness. Example: https://stackoverflow.com/questions/11644994/parse-xml-with-namespaces-in-java-using-xpath — vanje, Jan 22 '18 at 10:40

Andre Albert · Accepted Answer · 2018-01-22T11:44:18.743

0

The method getElementsByTagName(String) is return a List of matching elements (note plural 's'). You then need to specify which element (such as by using .item(index) to access a Node instance) you want to use. Therewith, you can using getNodeValue() on that Node object.

EDITED: because of the CDATA element, rather use Node.getTextContent():

NodeList elems = doc.getElementsByTagName("dc:title");
Node item = elems.item(0);
System.out.println(item.getTextContent());

edited Jan 22 '18 at 11:44

answered Jan 22 '18 at 08:23

Andre Albert

1,386
8
17

I did test your method but it still doesn't work. For index(0) = NULL , index(1) = java.lang.NullPointerException . I have added some sample data that i wanted to display at my post. – Drew Jan 22 '18 at 09:09
Your selectors are looking wrong. "dc" is the namespace prefix and not the tagname. Does it work using: `doc.getDocumentElement().getElementsByTagNameNS("*", "title");` [Namespaced tagnames](https://docs.oracle.com/javase/6/docs/api/org/w3c/dom/Element.html#getElementsByTagNameNS%28java.lang.String,%20java.lang.String%29) ? Also your example is looking not well formed - you should add a common root element. – Andre Albert Jan 22 '18 at 09:21
I tried using your suggestion. Still return the same result like `2nd element :com.sun.org.apache.xerces.internal.dom.DeepNodeListImpl@7e0d8db1` The example is just a partial code. It has root element, and the root element is something like this ` <![CDATA[someData]]> ` – Drew Jan 22 '18 at 09:44
But are you now able to call `.item(0).getNodeValue()` on the return NodeList object? – Andre Albert Jan 22 '18 at 09:56
@Drew - ok, i forgot the CDATA. I edited my Answer, maybe this helps – Andre Albert Jan 22 '18 at 11:45

score -1 · Answer 2 · answered Jan 22 '18 at 08:40

I would suggest using xpath to get the desired output. Also, refer following link for examples. https://www.journaldev.com/1194/java-xpath-example-tutorial For example:

XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//dc:title/text()";
NodeList nodes = (NodeList) xPath.compile(expression).evaluate(doc, XPathConstants.NODESET);
System.out.println(nodes.item(0).getNodeValue());

How to extract data from tag in java?

2 Answers2