5

I need to read the first 15 lines from about 100 XML files that are up to 200,000 lines long. Is there a way to use something like BufferedReader to do this efficiently? The steps outlined in this question use DocumentBuilder.parse(String); this tries to parse the entire file at once.

EDIT: The first 15 elements contain metadata about the file (page names, last edited dates, etc) that I would like to parse into a table.

Community
  • 1
  • 1
AnthonyW
  • 1,910
  • 5
  • 25
  • 46
  • 2
    DocumentBuilder (DOM) tries to parse everything. If you want to read **lines** you should actually use `BufferedReader`. If you want to read **tags** then you should use a SAX (org.xml.sax) reader (or a XML Reader) which will allow you to read the XML sequentially and respond to events caused by tags found. – helderdarocha Apr 28 '14 at 15:16
  • 1
    Once you have XML, try to read it as XML. I'm not sure if that's possible, but I would suggest to modify SAX parser (http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/) to end when you read first 15 elements, but be aware even long XML can be in just one line... – Betlista Apr 28 '14 at 15:17
  • 1
    You can count the number of elements read inside the `startElement` method, and stop when you read a certain number (of elements, not lines). – helderdarocha Apr 28 '14 at 15:18
  • I was hoping to take advantage of the xml friendly methods that a parser brings. Wouldn't I have to manually separate my elements if I used only BufferedReader? – AnthonyW Apr 28 '14 at 15:21
  • @AnthonyW: Yes, you will have to implement your own XML parser, and that's for sure something you do not want to do (to reinvent the wheel)... – Betlista Apr 28 '14 at 15:22
  • 1
    You can probably use a SAX parser and in the `characters()` method count the newlines. But if you actually want to extract something from the beginning of the file, you could simply stop when you find it. – helderdarocha Apr 28 '14 at 15:23
  • Perhaps you could add an example (code) of the beginning of the file you want to read. You might get better answers. – helderdarocha Apr 28 '14 at 15:25
  • Looks like a SAX parser is what I am looking for. – AnthonyW Apr 28 '14 at 15:29

5 Answers5

8

Here is probably what you want to do - as I wrote in comment, use SAX parser and when your condition for stopping is met use this

How to stop parsing xml document with SAX at any time?

edit:

test.xml

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <first>
        <inner>data</inner>
    </first>
    <second>second</second>
    <third>third</third>
    <next>next</next>
</root>

ReadXmlUpToSomeElementSaxParser.java

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class ReadXmlUpToSomeElementSaxParser extends DefaultHandler {

    private final String lastElementToRead;

    public ReadXmlUpToSomeElementSaxParser(String lastElementToRead) {
        this.lastElementToRead = lastElementToRead;
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        // just for showing what is parsed
        System.out.println("startElement: " + qName);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (lastElementToRead.equals(qName)) {
            throw new MySaxTerminatorException();
        }
    }

    public static void main(String[] args) throws Exception {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();

        try {
            saxParser.parse("src/test.xml", new ReadXmlUpToSomeElementSaxParser("second"));
        } catch (MySaxTerminatorException exp) {
            // nothing to do, expected
        }
    }

    public class MySaxTerminatorException extends SAXException {
    }

}

output

startElement: root
startElement: first
startElement: inner
startElement: second

Why is that better? Simply because some application can send you

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <first><inner>data</inner></first>
    <second>second</second>
    <third>third</third>
    <next>next</next>
</root>

and lines oriented approach will fail...

I provided the parser that is not counting elements to show that the condition can be defined based on business logic required to achieve...

characters() warning

For reading data in element you can use character() method, but please be aware that

SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks

read more in JavaDoc

Community
  • 1
  • 1
Betlista
  • 10,327
  • 13
  • 69
  • 110
4

Here's a simple solution that will read your file line by line until it stores 15 lines worth of data in the lines variable (Or less than 15 if your file is smaller).

File f = new File("your path");
BufferedReader br = null;
String lines = "";
try
{
    br = new BufferedReader(new FileReader(f));
    String line = null;
    int lineCount = 0;
    while((line = br.readLine()) != null)
    {
        lineCount++;
        lines += line + "\n";
        if(lineCount == 15) break;
    }
}
catch(Exception e)
{
    e.printStackTrace();
}
finally
{
    try{br.close();}catch(Exception e){}
}
4

I suggest looking into a streaming XML parser; the use case for streaming APIs extends to reading files that are several 100s of GB which obviously cannot fit in memory.

In Java, the StAX API is a (fairly large) evolution of native SAX APIs. Look through the tutorial here on parsing "on the fly":

http://tutorials.jenkov.com/java-xml/stax.html

Ishan Chatterjee
  • 825
  • 8
  • 21
2

It is better for you to read manually like below. DOM parser will be expensive in your case. You can use SAX parser if you really want to parse xml and extract/insert nodes.

try (BufferedReader br = new BufferedReader(new FileReader("C:\\testing.txt")))
{

    String sCurrentLine;

    while ((sCurrentLine = br.readLine()) != null) {
        System.out.println(sCurrentLine);
    }

} catch (IOException e) {
    e.printStackTrace();
} 
niiraj874u
  • 2,180
  • 1
  • 12
  • 19
  • Well it depends what OP intends to do with the first 15 lines. If they want to parse the XML, they should use a streamed parser, i.e. SAX, which doesn't load the entire document like a DOM parser does. – Zoltán Apr 28 '14 at 15:18
2

Suppose you want to read something like this:

<?xml ...?>
<root>
    <element>data</element>
    ...
    <otherElement>more data</otherElement>
    <ignoredElement> ... </ignoredElement>
    ... more ignored Elements
</root>

And you want only the first 13 child elements inside root (which happen to be within the first 15 lines of your very large file).

You can use a SAX parser to read the file and abort it as soon as it has read those elements.

You can set up a SAX parser using standard J2SE:

SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spf.newSAXParser();
XMLReader reader = sp.getXMLReader();

Then you need to create a ContentHandler class that will be your data handler. I will call it DataSaxHandler. If you extend DefaultHandler you just need to implement the methods that you are interested in. This is an example which you can use it as a starting point. It will detect the begin and end of each element and will print it out. It will count 15 end tags (it won't generate a well formed output) and it will ignore attributes. Use it as a starting point (I didn't test it):

public class DataSaxHandler extends DefaultHandler {

    private int countTags = 0;
    private boolean inElement = false;

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        System.out.println("<" + qName + ">");
        inElement = true;
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        countTags++;
        System.out.println("</" + qName + ">");
        inElement = false;

        if(countTags > 15) {
            // throw some exception to stop parsing
        }
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        if(inElement) {
            System.out.println(new String(ch, start, length));
        }
    }
}

You register it with your SAX reader and use it to parse the file.

    reader.setContentHandler(new DataSaxHandler());
    reader.parse(new InputSource(new FileInputStream(new File(PATH, "data.xml"))));
helderdarocha
  • 23,209
  • 4
  • 50
  • 65