3

I am trying to change a single value in a large (5mb) XML file. I always know the value will be in the first 10 lines, therefore I do not need to read in 99% of the file. Yet it seems doing a partial XML read in Java is quite tricky.

In this picture you can see the single value I need to access.

I have read a lot about XML in Java and the best practices of handling it. However, in this case I am unsure of what the best approach would be - A DOM, STAX or SAX parser all seem to have different best use case scenarios - and I am not sure which would best suit this problem. Since all I need to do is edit one value.

Perhaps, I shouldn't even use an XML parser and just go with regex, but it seem like it is a pretty bad idea to use regex on XML

Hoping someone could point me in the right direction, Thanks!

Community
  • 1
  • 1
nmu
  • 1,442
  • 2
  • 20
  • 40
  • 2
    Well there is no real such thing as a "partial read" as if you do not have the whole file, then it is likely the portal in improperly formated and thus will not parse, and if it will not parse then you cannot access its attributes. For such a small edit best bet may be to load the entire file as a string (significantly faster than trying to deserialize it) and just do a string replace/pattern search. – Wobbles May 17 '16 at 17:53
  • 2
    @Wobbles well thats just not true, SAX and StAX parsers are built for the exact scenario where you dont want to load the entire document into memory. – ug_ May 17 '16 at 17:55
  • Well, you can of course do a partial read. But there's no such thing as a partial write (except when appending to the end of a file). – Kayaman May 17 '16 at 18:02
  • Partial read is easy - in this case StAX will work perfectly. The issue is with writing - there is no such thing as a partial write; for example if you change "10" to "100" you will have to shift all the bytes in the file along by one ASCII character. So I would suggest you use SAX or StAX to stream the file from one location to another, changing that single value. Definitely don't read the entire `String` into memory. – Boris the Spider May 17 '16 at 18:16
  • 3
    @Wobbles that's wrong and a terrible idea. – Boris the Spider May 17 '16 at 18:16

2 Answers2

2

I would choose DOM over SAX or StAX simply for the (relative) simplicity of the API. Yes, there is some boilerplate code to get the DOM populated, but once you get past that it is fairly straight-forward.

Having said that, if your XML source is 100s or 1000s of megabytes, one of the streaming APIs would be better suited. As it is, 5MB is not what I would consider a large dataset, so go ahead and use DOM and call it a day:

import java.io.File;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.*;
import javax.xml.transform.stream.*;
import javax.xml.xpath.*;
import org.w3c.dom.*;

public class ChangeVersion
{
    public static void main(String[] args)
            throws Exception
    {
        if (args.length < 3) {
            System.err.println("Usage: ChangeVersion <input> <output> <new version>");
            System.exit(1);
        }

        File inputFile = new File(args[0]);
        File outputFile = new File(args[1]);
        int updatedVersion = Integer.parseInt(args[2], 10);

        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = domFactory.newDocumentBuilder();
        Document doc = docBuilder.parse(inputFile);

        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xpath = xpathFactory.newXPath();
        XPathExpression expr = xpath.compile("/PremiereData/Project/@Version");

        NodeList versionAttrNodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

        for (int i = 0; i < versionAttrNodes.getLength(); i++) {
            Attr versionAttr = (Attr) versionAttrNodes.item(i);
            versionAttr.setNodeValue(String.valueOf(updatedVersion));
        }

        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();

        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.transform(new DOMSource(doc), new StreamResult(outputFile));
    }
}
Sean Bright
  • 118,630
  • 17
  • 138
  • 146
  • 1
    Fair enough, and +1 for the complete and working example. I would be wary of reading even 5mb into DOM, but I'm sure you're right - the performance gain from streaming is probably dwarfed by the learning curve... – Boris the Spider May 17 '16 at 18:29
  • Completely right, the program barely even stuttered. Don't know what I was worried about - and this works like a charm, thanks! – nmu May 17 '16 at 18:51
2

You can use the StAX parser to write the XML as you read it. While doing this you can replace the content as it parses. Using a StAX parser will only contain parts of the xml in memory at any given time.

public static void main(String [] args) throws Exception {
    final String newProjectId = "888";

    File inputFile = new File("in.xml");
    File outputFile = new File("out.xml");
    System.out.println("Reading " + inputFile);
    System.out.println("Writing " + outputFile);

    XMLInputFactory inFactory = XMLInputFactory.newInstance();
    XMLEventReader eventReader = inFactory.createXMLEventReader(new FileInputStream(inputFile));
    XMLOutputFactory factory = XMLOutputFactory.newInstance();
    XMLEventWriter writer = factory.createXMLEventWriter(new FileWriter(outputFile));
    XMLEventFactory eventFactory = XMLEventFactory.newInstance();


    boolean useExistingEvent; // specifies if we should use the event right from the reader
    while (eventReader.hasNext()) {
        XMLEvent event = eventReader.nextEvent();
        useExistingEvent = true;

        // look for our Project element
        if(event.getEventType() == XMLEvent.START_ELEMENT) {
            // read characters
            StartElement elemEvent = event.asStartElement();
            Attribute attr = elemEvent.getAttributeByName(QName.valueOf("ObjectID"));
            // check to see if this is the project we want 
            // TODO: put what logic you want here
            if("Project".equals(elemEvent.getName().getLocalPart()) && attr != null && attr.getValue().equals("1")) {
                Attribute versionAttr = elemEvent.getAttributeByName(QName.valueOf("Version"));

                // we need to make a list of new attributes for this element which doesnt include the Version a
                List<Attribute> newAttrs = new ArrayList<>(); // new list of attrs
                Iterator<Attribute> existingAttrs = elemEvent.getAttributes();
                while(existingAttrs.hasNext()) {
                    Attribute existing = existingAttrs.next();
                    // copy over everything but version attribute
                    if(!existing.getName().getLocalPart().equals("Version")) {
                        newAttrs.add(existing);
                    }
                }
                // add our new attribute for projectId
                newAttrs.add(eventFactory.createAttribute(versionAttr.getName(), newProjectId));

                // were using our own event instead of the existing one
                useExistingEvent = false;
                writer.add(eventFactory.createStartElement(elemEvent.getName(), newAttrs.iterator(), elemEvent.getNamespaces()));
            }
        }

        // persist the existing event.
        if(useExistingEvent) {
            writer.add(event);
        }

    }
    writer.close();
}
ug_
  • 11,267
  • 2
  • 35
  • 52