XPath was designed for exactly this. Java provides support for it in the javax.xml.xpath package.
To do what you want, the code will look something like this:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[@id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
Some basic XPath information:
- XPath uses a single file-path-like string to match parts of an XML document. The parts can be any structural part of the document: text, elements, attributes, even things like comments.
- A Java XPath expression can attempt to match exactly one part, or several parts, or can even concatenate all matched parts as a String.
- In an XPath expression, a name by itself represents an element. For example,
WordForm
in XPath means any <WordForm …>
element in the XML document.
- A name starting with
@
represents an attribute. For example, @writtenForm
refers to any writtenForm=…
attribute in the XML document.
- A slash indicates a parent and child in an XML document.
LexicalEntry/Lemma
means any <Lemma>
element which is a direct child of a <LexicalEntry>
element. Synset/@id
means the id=…
attribute of any <Synset>
element.
- Just as a path starting with
/
indicates an absolute (root-relative) path in Unix, an XPath starting with a slash indicates an expression relative to the root of an XML document.
- Two slashes means a descendant which may be a direct child, a grandchild, a great-grandchild, etc. Thus,
//LexicalEntry
means any LexicalEntry in the document; /LexicalEntry
only matches a LexicalEntry element which is the root element.
- Square brackets indicate match qualifiers.
Synset[@baseConcept='3']
matches any <Synset>
element with an baseConcept attribute whose value is the string "3".
- XPath can refer to variables, which are defined externally, using Unix-shell-like
$
substitutions, like $word
. How those variables are passed to an XPath expression depends on the engine. Java uses the setXPathVariableResolver method. Variable names are in a completely separate namespace from node names, so it is of no consequence if a variable name is the same as an element name or attribute name in the XML document.
So, the XPath expressions in the code mean:
//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset
Match any <LexicalEntry>
element anywhere in the XML document which has either
- a WordForm child with a writtenForm attribute whose value is equal to the
word
variable
- a Lemma child with a writtenForm attribute whose value is equal to the
word
variable
and for every such <LexicalEntry>
element, return the value of the synset
attribute of any <Sense>
element which is a direct child of the <LexicalEntry>
element.
The word
variable is defined externally, by an xpath.setXPathVariableResolver
, right before the XPath expression is evaluated.
//Synset[@id=$id]/SynsetRelations/SynsetRelation
Match any <Synset>
element anywhere in the XML document whose id
attribute is equal to the id
variable. For each such <Synset>
element, look for any direct SynsetRelations child element, and return each of its direct SynsetRelation children.
The id
variable is defined externally, by an xpath.setXPathVariableResolver
, right before the XPath expression is evaluated.
//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm
Match any <LexicalEntry>
element anywhere in the XML document which has a <Sense>
child element which has a synset
attribute whose value is identical to the synset
variable. For each matched element, find any <WordForm>
child element and return that element’s writtenForm
attribute.
The synset
variable is defined externally, by an xpath.setXPathVariableResolver
, right before the XPath expression is evaluated.
Logically, what the above should amount to is:
- Locate the synset value for the requested word.
- Use the synset value to locate SynsetRelation elements.
- Locate writtenForm values corresponding to the targets value of each matched SynsetRelation.