0

OK, so I have been searching for hours about my problem but nothing seems to come up. So here's my code snippet followed by the problem:

Pattern forKeys = Pattern.compile("^<feature>\\s*<name>Deviation</name>.*?</feature>", Pattern.DOTALL|Pattern.MULTILINE);
Matcher n = forKeys.matcher("");
String aLine = null;
    while((aLine = in.readLine()) != null) {
         n.reset(aLine);
         String result = n.replaceAll("");
         out.write(result);
         out.newLine();
    }

let's just assume the undeclared variables are already declared..

my point is, my RegEx (and maybe the matcher also) is not working properly.

I want to erase the parts with the "<feature><name>Deviation</name>*any character/s here*</feature>" included in the ff lines:

<feature>
    <name>Deviation</name>
            <more words here>
</feature>
<feature>
    <name>Average</name>
</feature>
    <feature>
    <name>Deviation</name>
            sample words
</feature>

I think my problem is the use of repititive operators (how to traverse line breaks, tabs, etc), but I can't seem to find the correct expression.

Any ideas? Thanks in advance.

legaicy
  • 3
  • 2

1 Answers1

0

Parsing HTML or XML with regex is evil and error-prone.

Use an XML parser and things will work much better.
Here's a solution for your problem using Dom4J:

// parse XML source
Document document = DocumentHelper.parseText(yourXmlText);

Iterator<Element> featureIterator =
    // get an iterator for all <feature> elements
    document.getRootElement().elementIterator("feature");

while(featureIterator.hasNext()){
    Element featureElement = featureIterator.next();
    // if <feature> has a child <name> with Content "Deviation"
    if("Deviation").equals(featureElement.elementTextTrim("name")){
        // remove this <feature> element
        featureIterator.remove();
    }
}

// write modified XML back to file
new XMLWriter(
    new FileOutputStream(yourXmlFile), OutputFormat.createPrettyPrint()
).write(document);

Apart from that you are also making a mistake (see my comments):

// aLine is just a single line
while((aLine = in.readLine()) != null) {
     n.reset(aLine);
     // yet you want to replace a multi-line pattern
     String result = n.replaceAll("");
     out.write(result);
     out.newLine();
}

Your regex might or might not work if you read the entire file to a String, but it can't work if you apply it on individual lines.

Community
  • 1
  • 1
Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
  • Thanks for the quick reply! I'll look into your suggestion since XML parsers have not entered my mind (I'm inexperienced in using Java, and so the limited knowledge). Will let you know as soon as I solved my problem. Thanks again! – legaicy Mar 17 '11 at 02:02
  • 1
    Just a followup. I studied XML parsing and instead of using what you suggested, I tried DOM and it's working seamlessly! Thanks for giving me the right ideas. :) – legaicy Mar 17 '11 at 09:36
  • Got some not-so-good news..after successfully testing DOM for small files, it seems that it is not capable of processing megabyte-sized xml files efficiently (either crashes or takes too long if heap is increased).. so again, I'm looking into two parsers: StaX and your suggestion, DOM4J. Will update this thread again for any news. – legaicy Mar 17 '11 at 15:47
  • @legaicy next time try to give more context. yes, dom4j is inefficient for huge files because of all the sugar it adds. StaX or XPP are a lot more efficient, but also less comfortable – Sean Patrick Floyd Mar 17 '11 at 15:55
  • Hey, I actually used dom4j as it looks more user-friendly. It turns out that it can process 25MB XML files, and I think it took just a few seconds to process one file. Took only some minutes to modify and create 100+ XML files. Again, thank you very much! edit:actually, I think it took less than one minute to process all of my files. :) – legaicy Mar 20 '11 at 07:37