1

is there a way to remove the comments from a huge xml file (>200 MB), parsed by vtd-xml ?

Both, comments before the root element

<!-- comment -->
<rootElement>
.
.
.
 </rootElement>

and comments within

<rootElement>
<book>
<!-- comment -->
</book>
</rootElement>

The best solution would be with xPath. I tried

//comment()

which works with DOM but not with vtd-xml

Here is my code for selecting comments

String xPath = "//comment()"
XMLModifier xm = new XMLModifier();
VTDGen vg = new VTDGen();
if (vg.parseFile(fnIn,true)){
       VTDNav vn = vg.getNav();
       xm.bind(vn);
       nodeXpath(xPath,vn);
}

private void nodeXpath(String xPath, VTDNav vn) throws Exception{
    int result;

    AutoPilot ap = new AutoPilot();
    ap.selectXPath(xPath);
    ap.bind(vn);
    while((result = ap.evalXPath())!=-1){
        int p = vn.getText();

        if (p!=-1) {                
            System.out.println(vn.getText() + ", " + vn.toString(p));               
        }
    }
}

But the nothing is printed to screen here.

Is there a way to do that with vtd xml?

Thanks for your help.

Kepler
  • 57
  • 9
  • You say that the XPath expression doesn't "work" with VTD-XML. Exactly what did you try, and what was the result? Maybe looking at http://stackoverflow.com/a/22161292/423105 will help you more get to more concrete questions. – LarsH Aug 17 '15 at 17:40
  • I know how to use xPath Expressions and the Modifier. I tried //comment() as Expression but it doesn't work. In DOM it selected the correct texts. Maybe there is a possibility with vtd-xml to identify all comments in the document no matter where they are – Kepler Aug 18 '15 at 08:04
  • Yes the same `//comment()` expression should select all comments in VTD-XML. The question then is how you use that to remove them all. Show us your VTD-XML code (the Java, C# or whatever) and maybe we can help you figure out why it didn't work. – LarsH Aug 18 '15 at 15:13
  • Hi LarsH, I added the code in my question. – Kepler Aug 18 '15 at 17:09
  • You mentioned that your code prints nothing to the screen... not even commas? I wouldn't expect it to necessarily print anything from `getText()`, since the doc for getText() seems to indicate that it returns "the type character data or CDATA", which I don't think includes the content of a comment. A good test would be to print something in every iteration of your while loop **before** `p = vn.getText()`, so you'll know whether it's finding the comments at all. – LarsH Aug 18 '15 at 18:05
  • If it is finding the comments, I think you'll want to call xm.removeToken(result) on each one. – LarsH Aug 18 '15 at 18:08
  • Thanks very much ! Is there also a way to remove the empty lines from the output, where the comments were placed? – Kepler Aug 18 '15 at 19:22
  • I agree with larsH. You should not use getText() to get comment node. You may want to get the output of XPath evaluation directly to obtain comment nodes. Can you try it to see it works or not? – vtd-xml-author Aug 19 '15 at 03:54
  • I already tried and it works. Thanks – Kepler Aug 19 '15 at 15:17
  • Regarding removal of empty lines, see this question. http://stackoverflow.com/questions/17441393/remove-the-remaining-new-line-after-using-vtd-xml-to-remove-an-element Also, I put my suggestion into an actual "answer", so this question can be removed from the "unanswered" list. Please consider clicking the "accept" checkmark if it solved your original problem. – LarsH Aug 19 '15 at 16:34

1 Answers1

0

You mentioned that your code prints nothing to the screen... not even commas? I wouldn't expect it to necessarily print anything from getText(), since the doc for getText() seems to indicate that it returns "the type character data or CDATA", which I don't think includes the content of a comment. (Thank you, @vtd-xml-author, for confirming that.)

A good test would be to print something in every iteration of your while loop before p = vn.getText(), so you'll know whether it's finding the comments at all.

If it is finding the comments, I think you'll want to call xm.removeToken(result) on each one.

LarsH
  • 27,481
  • 8
  • 94
  • 152