0

I have a large XML file and below is an extract from it:

...
<LexicalEntry id="Ait~ifAq_1">
  <Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/>
  <Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/>
  <WordForm formType="root" writtenForm="وفق"/>
</LexicalEntry>
<LexicalEntry id="tawaA&amp;um__1">
  <Lemma partOfSpeech="n" writtenForm="تَوَاؤُم"/>
  <Sense id="tawaA&amp;um__1_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
  <WordForm formType="root" writtenForm="وأم"/>
</LexicalEntry>    
<LexicalEntry id="tanaAgum_2">
  <Lemma partOfSpeech="n" writtenForm="تناغُم"/>
  <Sense id="tanaAgum_2_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
  <WordForm formType="root" writtenForm="نغم"/>
</LexicalEntry>


<Synset baseConcept="3" id="tawaAfuq_n1AR">
  <SynsetRelations>
    <SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
    <SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
    <SynsetRelation relType="hypernym" targets="ext_noun_NP_420"/>
  </SynsetRelations>
  <MonolingualExternalRefs>
    <MonolingualExternalRef externalReference="13971065-n" externalSystem="PWN30"/>
  </MonolingualExternalRefs>
</Synset>
...

I want to extract specific information from it. For a given writtenForm whether from <Lemma> or <WordForm>, the programme takes the value of synset from <Sense> of that writtenForm (same <LexicalEntry>) and searches for all the value id of <Synset> that have the same value as the synset from <Sense>. After that, the programme gives us all the relations of that Synset, i.e it displays the value of relType and returns to <LexicalEntry> and looks for the value synset of <Sense> who have the same value of targets then displays its writtenForm.

I think it's a little bit complicated but the result should be like this:

اِتِّفاق hyponym تَوَاؤُم, اِنْسِجام

One of the solutions is the use of the Stream reader because of the memory consumption. but I don't how should I proceed to get what I want. help me please.

bttX
  • 33
  • 11

3 Answers3

1

The SAX Parser is different from DOM Parser.It is looking only on the current item it can't see on the future items until they become the current item . It is one of the many you can use when XML file is extremely big . Instead of it there are many out there . To name a few:

  • SAX PARSER
  • DOM PARSER
  • JDOM PARSER
  • DOM4J PARSER
  • STAX PARSER

You can find for all them tutorials here.

In my opinion after learning it go straight to use DOM4J or JDOM for commercial product.

The logic of SAX Parser is that you have a MyHandler class which is extending DefaultHandler and @Overrides some of it's methods:

XML FILE:

<?xml version="1.0"?>
<class>
   <student rollno="393">
      <firstname>dinkar</firstname>
      <lastname>kad</lastname>
      <nickname>dinkar</nickname>
      <marks>85</marks>
   </student>
   <student rollno="493">
      <firstname>Vaneet</firstname>
      <lastname>Gupta</lastname>
      <nickname>vinni</nickname>
      <marks>95</marks>
   </student>
   <student rollno="593">
      <firstname>jasvir</firstname>
      <lastname>singn</lastname>
      <nickname>jazz</nickname>
      <marks>90</marks>
   </student>
</class>

Handler class:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class UserHandler extends DefaultHandler {

   boolean bFirstName = false;
   boolean bLastName = false;
   boolean bNickName = false;
   boolean bMarks = false;

   @Override
   public void startElement(String uri, 
   String localName, String qName, Attributes attributes)
      throws SAXException {
      if (qName.equalsIgnoreCase("student")) {
         String rollNo = attributes.getValue("rollno");
         System.out.println("Roll No : " + rollNo);
      } else if (qName.equalsIgnoreCase("firstname")) {
         bFirstName = true;
      } else if (qName.equalsIgnoreCase("lastname")) {
         bLastName = true;
      } else if (qName.equalsIgnoreCase("nickname")) {
         bNickName = true;
      }
      else if (qName.equalsIgnoreCase("marks")) {
         bMarks = true;
      }
   }

   @Override
   public void endElement(String uri, 
   String localName, String qName) throws SAXException {
      if (qName.equalsIgnoreCase("student")) {
         System.out.println("End Element :" + qName);
      }
   }

   @Override
   public void characters(char ch[], 
      int start, int length) throws SAXException {
      if (bFirstName) {
         System.out.println("First Name: " 
            + new String(ch, start, length));
         bFirstName = false;
      } else if (bLastName) {
         System.out.println("Last Name: " 
            + new String(ch, start, length));
         bLastName = false;
      } else if (bNickName) {
         System.out.println("Nick Name: " 
            + new String(ch, start, length));
         bNickName = false;
      } else if (bMarks) {
         System.out.println("Marks: " 
            + new String(ch, start, length));
         bMarks = false;
      }
   }
}

Main Class :

import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class SAXParserDemo {
   public static void main(String[] args){

      try { 
         File inputFile = new File("input.txt");
         SAXParserFactory factory = SAXParserFactory.newInstance();
         SAXParser saxParser = factory.newSAXParser();
         UserHandler userhandler = new UserHandler();
         saxParser.parse(inputFile, userhandler);     
      } catch (Exception e) {
         e.printStackTrace();
      }
   }   
}
GOXR3PLUS
  • 6,877
  • 9
  • 44
  • 93
  • thank you. I think it is clear right now. So if I want to extract information from my file I should manipulate `startElement` method as my need? – bttX Dec 22 '16 at 16:44
  • @bttX You have to manipulate all the methods in the Handler . Each one is for each own purpose . You can detect everything from `@properties` ,`comments` to every single `Element` . I recommend that you watch the tutorials first ( You need them ) . – GOXR3PLUS Dec 22 '16 at 16:54
  • Thank you. Do you have any tutorials you advise me to begin with? – bttX Dec 22 '16 at 20:19
1

XPath was designed for exactly this. Java provides support for it in the javax.xml.xpath package.

To do what you want, the code will look something like this:

List<String> findRelations(String word,
                           Path xmlFile)
throws XPathException {

    String xmlLocation = xmlFile.toUri().toASCIIString();

    XPath xpath = XPathFactory.newInstance().newXPath();

    xpath.setXPathVariableResolver(
        name -> (name.getLocalPart().equals("word") ? word : null));
    String id = xpath.evaluate(
        "//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset",
        new InputSource(xmlLocation));

    xpath.setXPathVariableResolver(
        name -> (name.getLocalPart().equals("id") ? id : null));
    NodeList matches = (NodeList) xpath.evaluate(
        "//Synset[@id=$id]/SynsetRelations/SynsetRelation",
        new InputSource(xmlLocation),
        XPathConstants.NODESET);

    List<String> relations = new ArrayList<>();

    int matchCount = matches.getLength();
    for (int i = 0; i < matchCount; i++) {
        Element match = (Element) matches.item(i);

        String relType = match.getAttribute("relType");
        String synset = match.getAttribute("targets");

        xpath.setXPathVariableResolver(
            name -> (name.getLocalPart().equals("synset") ? synset : null));
        NodeList formNodes = (NodeList) xpath.evaluate(
            "//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm",
            new InputSource(xmlLocation),
            XPathConstants.NODESET);

        int formCount = formNodes.getLength();
        StringJoiner forms = new StringJoiner(",");
        for (int j = 0; j < formCount; j++) {
            forms.add(
                formNodes.item(j).getNodeValue());
        }

        relations.add(
            String.format("%s %s %s", word, relType, forms));
    }

    return relations;
}

Some basic XPath information:

  • XPath uses a single file-path-like string to match parts of an XML document. The parts can be any structural part of the document: text, elements, attributes, even things like comments.
  • A Java XPath expression can attempt to match exactly one part, or several parts, or can even concatenate all matched parts as a String.
  • In an XPath expression, a name by itself represents an element. For example, WordForm in XPath means any <WordForm …> element in the XML document.
  • A name starting with @ represents an attribute. For example, @writtenForm refers to any writtenForm=… attribute in the XML document.
  • A slash indicates a parent and child in an XML document. LexicalEntry/Lemma means any <Lemma> element which is a direct child of a <LexicalEntry> element. Synset/@id means the id=… attribute of any <Synset> element.
  • Just as a path starting with / indicates an absolute (root-relative) path in Unix, an XPath starting with a slash indicates an expression relative to the root of an XML document.
  • Two slashes means a descendant which may be a direct child, a grandchild, a great-grandchild, etc. Thus, //LexicalEntry means any LexicalEntry in the document; /LexicalEntry only matches a LexicalEntry element which is the root element.
  • Square brackets indicate match qualifiers. Synset[@baseConcept='3'] matches any <Synset> element with an baseConcept attribute whose value is the string "3".
  • XPath can refer to variables, which are defined externally, using Unix-shell-like $ substitutions, like $word. How those variables are passed to an XPath expression depends on the engine. Java uses the setXPathVariableResolver method. Variable names are in a completely separate namespace from node names, so it is of no consequence if a variable name is the same as an element name or attribute name in the XML document.

So, the XPath expressions in the code mean:

//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset

Match any <LexicalEntry> element anywhere in the XML document which has either

  • a WordForm child with a writtenForm attribute whose value is equal to the word variable
  • a Lemma child with a writtenForm attribute whose value is equal to the word variable

and for every such <LexicalEntry> element, return the value of the synset attribute of any <Sense> element which is a direct child of the <LexicalEntry> element.

The word variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.

//Synset[@id=$id]/SynsetRelations/SynsetRelation

Match any <Synset> element anywhere in the XML document whose id attribute is equal to the id variable. For each such <Synset> element, look for any direct SynsetRelations child element, and return each of its direct SynsetRelation children.

The id variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.

//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm

Match any <LexicalEntry> element anywhere in the XML document which has a <Sense> child element which has a synset attribute whose value is identical to the synset variable. For each matched element, find any <WordForm> child element and return that element’s writtenForm attribute.

The synset variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.


Logically, what the above should amount to is:

  • Locate the synset value for the requested word.
  • Use the synset value to locate SynsetRelation elements.
  • Locate writtenForm values corresponding to the targets value of each matched SynsetRelation.
VGR
  • 40,506
  • 4
  • 48
  • 63
  • Thank you but this code gets me a duplicates results `[اِتِّفاق hyponym سجم,نغم,نسق,وأم,وأم, اِتِّفاق hyponym سجم,نغم,نسق,وأم,وأم, اِتِّفاق hyponym سجم,نغم,نسق,وأم,وأم, اِتِّفاق hyponym سجم,نغم,نسق,وأم,وأم, اِتِّفاق hypernym ]` I am not familiar with `XPATH` so can you add some comments in the code – bttX Dec 23 '16 at 08:48
  • @bttX Updated answer with explanation of XPath expressions. – VGR Dec 23 '16 at 15:50
  • Now that makes sense. Thank you. So, from now on, If I want to locate any specific value of an attribute I should use `XPath`? – bttX Dec 24 '16 at 10:50
  • Unless your XML document is very simple, XPath is often the easiest way to locate an element or attribute, yes. – VGR Dec 26 '16 at 17:41
  • Define simple :) what do you mean by that? – bttX Dec 26 '16 at 17:43
  • “Simple” in this case is a document which is small and easily traversed with org.w3c.dom classes. A document consisting of a root element and one level of child elements would be an example of that. – VGR Dec 26 '16 at 17:51
  • So it is not my case because if you look to the extract that I posted you will find that my `XML` file is a bit of complicated one. – bttX Dec 26 '16 at 17:53
0

If this XML file is too large to represent in memory, use SAX.

You will want to write your SAX parser to maintain a location. To do this, I typically use a StringBuffer, but a Stack of Strings would work just as nicely. This portion will be important because it will permit you to keep track of the path back to the root of the document, which will allow you to understand where in the document you are at a given point in time (useful when trying to only extract a little information).

The main logic flow looks like:

 1. When entering a node, add the node's name to the stack.
 2. When exiting a node, pop the node's name (top element) off the stack.
 3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
 4. When entering a region you care about, clear the buffer you will capture the characters into
 5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.

This way you can efficiently skip over all the branches of the XML tree that you don't care about.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • At the beginning, I create my `parser` and my `handle` and when I start to override the `startElement` method I didn't know what to put and from where should I begin! – bttX Dec 22 '16 at 15:20
  • 1
    https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html is a good place to start. You can't get around learning the toolkits if you wish to use them, and after you learn the toolkit, you'll realize that with SAX you have to do a lot yourself (which is where my comments above will provide you with the most utility). – Edwin Buck Dec 22 '16 at 15:30
  • Excuse me but I did not understand your comments. should I use them with `SAX` or is it another method. I am confused here – bttX Dec 22 '16 at 15:39
  • Once you know a bit about SAX, then my instructions will provide you a higher level direction on how to use SAX to solve your problem. Basically I'm talking about how to build an arch out of bricks, but you first need to know how to stick bricks together before you attempt your first arch. :) – Edwin Buck Dec 23 '16 at 17:18
  • I read the doc that you gave me and some other too and I found That I have to put what I need in the `startElement` method am I correct? also I need to follow your instructions to do so? – bttX Dec 24 '16 at 10:08
  • Some of what you need will go in startElement. There are ways to get what you need without my instructions, and if you find a different path gets you there faster, take it. My instructions add a "path" that is updated as the document is being parsed, so you can do things dependent on being at the right element (say /root/list/item/name, for example) but maybe your need doesn't require different behavior based on your position within the document. – Edwin Buck Dec 28 '16 at 20:54