0

I'm looking for a regular expression, but can't find.

Parsing a text file looking like that

    <resource name="/_op_sox/Project/Default/ICDocumentation/Evaluation/Allianz/Allianz SE/Eval_01241.txt"
              inheritAcls="true">
        <bundle name="AZEvaluation">
            <property name="End Date">
            </property>
            <property name="Evaluation Type">
                <propertyValue name="RCSA"/>
            </property>
        </bundle>
    </resource>
    <resource name="/_op_sox/Project/Default/ICDocumentation/Evaluation/Allianz/Allianz SE/Eval_01481.txt"
              inheritAcls="true">
        <bundle name="AZEvaluation">
            <property name="End Date">
            </property>
            <property name="Evaluation Type">
                <propertyValue name="TRA"/>
            </property>
        </bundle>
    </resource>
   <resource name="/_op_sox/Project/Default/ICDocumentation/Evaluation/Allianz/Allianz SE/Eval_01362.txt"
              inheritAcls="true">
        <bundle name="AZEvaluation">
            <property name="End Date">
            </property>
            <property name="Evaluation Type">
                <propertyValue name="RCSA"/>
            </property>
        </bundle>
    </resource>

My current regex matches to much.

<resource.+?<propertyValue name="RCSA".+?</resource>

It matches the first resource tag and the second + third. Can somebody change the regex that it really stops at the first </resource>

I use this Java code

Pattern.compile("<resource.+?<propertyValue name=\"RCSA\".+?</resource>",Pattern.MULTILINE | Pattern.DOTALL)
High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
Nabor
  • 1,661
  • 3
  • 20
  • 45
  • 13
    Use an XML parser. – YXD Feb 20 '12 at 11:09
  • Simple question - why are you not using XML tools for this ? regexp isn't good over XML or HTML – SergeS Feb 20 '12 at 11:09
  • Why not use Jsoup? It would be trivial to find the first `resource` element. –  Feb 20 '12 at 11:09
  • 2
    Is there a specific reason you are not using an XML parser and XPath for that matter? – stryba Feb 20 '12 at 11:09
  • 4
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Mat Feb 20 '12 at 11:10
  • Yes it is. Using Regex for doing some search and replace over a 200MB Textfile is fast written without the overhead of implementing a SAX Parser. DOM Parser is not working because of memory usage. So my question was not "Who can I solve my problem with XML Parser?", it was how can I change the regex that it does what I like ;) – Nabor Feb 20 '12 at 11:17
  • I got it... `` – Nabor Feb 20 '12 at 12:07

2 Answers2

0

As Mr E points out, this is not the best way to read data from an XML file at all. Not to mention if you suddenly have to deal with nested elements! However, this will match the name attribute of the propertyValue inside a resource.

<resource.+?<propertyValue name=(["'])([^"']*)\1.+?</resource>
Aram Kocharyan
  • 20,165
  • 11
  • 81
  • 96
  • I don't need the content of the name attribute. I want to replace the whole resource element if the name attribute of the property value is RCSA. – Nabor Feb 20 '12 at 11:19
  • 1
    Ah I see, in any case, consider using an XML parser and go traverse through the children. Even if you get it working this way it isn't a long term solution and is doomed to fail on some XML files eventually. – Aram Kocharyan Feb 20 '12 at 11:26
  • The XML File is 200MB large. It has a lot of different Tags, that i have not mentioned here. So fare a used 5 different regex to reduce the file or change some content. Writing an XML Parse will take me hours... – Nabor Feb 20 '12 at 11:31
  • 1
    Luckily it's been done before: http://stackoverflow.com/questions/373833/best-xml-parser-for-java – Aram Kocharyan Feb 20 '12 at 11:33
  • 2
    Tempted to down-vote just for mentioning any option other than "use an xml parser" :) – Dmitri Feb 20 '12 at 11:39
  • Consider even the slightest change, say they have a space between the `/` or something, then you'd need a more complicated regex. Definitely leave it to the parsers for this. – Aram Kocharyan Feb 20 '12 at 13:00
0

I solved it with this Expression: <resource(?:(?!<propertyValue).)+<propertyValue name="RCSA"(?:(?!<resource).)+</resource> but it's to slow. So I looked a bit around what else can be done in Java and found an easy and fast solution.

    Pattern p = Pattern.compile("<resource name=.+?</resource>",
            Pattern.MULTILINE | Pattern.DOTALL);
    String in = getStringFromFile(path, name, pre, count);
    System.out.println("Länge: " + in.length());
    Matcher m = p.matcher(in);
    StringBuffer sb = new StringBuffer();
    int c = 0;
    while (m.find()) {
        m.appendReplacement(sb, getReplacementStage1(m, c++));
    }
    m.appendTail(sb);
    writeStringToFile(path, name, pre, count, sb.toString());

So first I use an easier and faster RegEx and then instead of using String.replaceAll I use the matcher to have the chance to calculate the replacement for every find.

private static String getReplacementStage1(Matcher m, int c) {
    Pattern p1 = Pattern.compile(
            "<resource[^>]*?contentType=\"Evaluation\"", Pattern.MULTILINE
                    | Pattern.DOTALL);
    Matcher m1 = p1.matcher(m.group());
    if (!m1.find()) {
        // remove
        return "";
    }
    Pattern p2 = Pattern.compile("<propertyValue name=\"(?:RCSA|TRA)\"",
            Pattern.MULTILINE | Pattern.DOTALL);
    Matcher m2 = p2.matcher(m.group());
    if (m2.find()) {
        // remove
        return "";
    }
    // no change, return the group
    return m.group();
}

So may be this solution helps somebody with a similar problem, that don't likes/needs an XML parser...

Nabor
  • 1,661
  • 3
  • 20
  • 45