0

I'm using SAX (Simple API for XML) to parse an XML document. My purpose is to parse the document so that i can separate entities from the the XML and create an ER Diagram from these entities (which i will create manually after i get all the entities the file have). Although i'm on very initial stage of coding every thing i have discussed above, but i' just stuck at this particular problem right now.

here is my code:

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class Parser extends DefaultHandler {

  public void getXml() {
    try {
      SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
      SAXParser saxParser = saxParserFactory.newSAXParser();
      final MySet openingTagList = new MySet();
      final MySet closingTagList = new MySet();
      DefaultHandler defaultHandler = new DefaultHandler() {

        public void startDocument() throws SAXException {
          System.out.println("Starting Parsing...\n");
        }

        public void endDocument() throws SAXException {
          System.out.print("\n\nDone Parsing!");
        }

        public void startElement(String uri, String localName, String qName,
          Attributes attributes) throws SAXException {
          if (!openingTagList.contains(qName)) {
            openingTagList.add(qName);
            System.out.print("<" + qName + ">");
          }
        }

        public void characters(char ch[], int start, int length)
        throws SAXException {
          for (int i = start; i < (start + length); i++) {
            System.out.print(ch[i]);
          }
        }

        public void endElement(String uri, String localName, String qName)
        throws SAXException {
          if (!closingTagList.contains(qName)) {
            closingTagList.add(qName);
            System.out.print("</" + qName + ">");
          }
        }
      };

      saxParser.parse("student.xml", defaultHandler);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  public static void main(String args[]) {
    Parser readXml = new Parser();
    readXml.getXml();
  }
}

What i'm trying to achieve is when the startElement method detects that the tag was already traversed it should skip the tag as well all the other entities inside the tag, but i'm confused about how to implement that part.

Note: Purpose is to read the tags, i don't care about the records in between them. MySet is just an abstraction which contains method like contains (if the set has the passed data) etc nothing much.

Any help would be appropriated. Thanks

Subhan
  • 1,544
  • 3
  • 25
  • 58
  • What is the exact problem? Any output? What is the content of your sets? – chris Mar 30 '15 at 19:15
  • set is just a Vector implemented by me which contains some more methods just like checking if the value is already present in the Set. Anyways the problem is how can i implement the functionality to skip all the inner tags when an already traversed tag is found. – Subhan Mar 30 '15 at 19:21
  • Ah ok. Why do you want to do this? Is it a huge file? I think you want to collect the tag names - this should already work with your code. – chris Mar 30 '15 at 19:23
  • yes it is a file from DBLP (1.46 gb), but first i'm testing it on small files. Some companies do this that's what i can say for why. Bus same tags are repeated over and over, thats what my question is how to skip – Subhan Mar 30 '15 at 19:25
  • After some reading the net: I'm afraid that's not possible as sax (must) visit all nodes. Here is a similar question: http://stackoverflow.com/questions/18064716/sax-parser-to-skip-some-elements-which-are-not-to-be-parsed. Note the STAX link in one answer. Maybe this would help you. – chris Mar 30 '15 at 19:32
  • Is there a better way of traversing the such huge file for entities only in your opinion? – Subhan Mar 30 '15 at 19:35

1 Answers1

0

Due to the nature of xml it's not possible to know which tags will appear later in the file. So there is no 'skip the next x bytes'-trick.
Just ask for reasonable sized files - maybe there is a possibility to split the data.
In my opinion reading a xml file with more than 1 gb is no fun - regardless of the used library.

chris
  • 1,685
  • 3
  • 18
  • 28
  • So how do you suggest me to deal with it? because i have to do it any way. – Subhan Mar 30 '15 at 19:46
  • How fast does your code run? How long do you need to parse the file? What is the exact problem? As I see your code should work fine. – chris Mar 30 '15 at 19:48
  • The code works fine because i'm testing it on a small file, but intention is to parse a huge file (1.46 GB). So all i'm asking is for a better way because i'm quite a newbie in XML and parsing stuff. – Subhan Mar 30 '15 at 19:50
  • 1
    SAX is a good way to parse big files because you dont have to store all the data in ram (as it is done with DOM). I would use your code. It's a quite smart solution. – chris Mar 30 '15 at 19:58
  • Okay i worked on it and got some idea about that. Can you suggest me any idea about how i can create parent-child hierarchy in just like DOM in SAX? – Subhan Mar 30 '15 at 21:00
  • I would recommend to ask a new question about that – chris Mar 31 '15 at 03:59
  • Okay Please answer it here: http://stackoverflow.com/questions/29360901/getting-parent-child-hierarchy-in-sax-xml-parser – Subhan Mar 31 '15 at 05:07