2

Hi I need to write a regular expression in java that will find all instances of :

wsp:rsidP="005816D6" wsp:rsidR="005816D6" wsp:rsidRDefault="005816D6" 

attributes in an XML string and strip them out:

So I need to rip out all attributes that starts with wsp:rsid and ends with a double quote (")

Thoughts on this:

  1. String str = xmlstring.replaceAll("wsp:rsid/w", "");
  2. String str = xmlstring.replaceAll("wsp:rsid[]\\"", "");
Bohemian
  • 412,405
  • 93
  • 575
  • 722
nicordesigns
  • 904
  • 1
  • 9
  • 18

4 Answers4

2
xmlstring.replaceAll( "wsp:rsid\\w*?=\".*?\"", "" );

This works in my tests...

public void testReplaceAll() throws Exception {
    String regex = "wsp:rsid\\w*?=\".*?\"";

    assertEquals( "", "wsp:rsidP=\"005816D6\"".replaceAll( regex, "" ) );
    assertEquals( "", "wsp:rsidR=\"005816D6\"".replaceAll( regex, "" ) );
    assertEquals( "", "wsp:rsidRDefault=\"005816D6\"".replaceAll( regex, "" ) );
    assertEquals( "a=\"1\" >", "a=\"1\" wsp:rsidP=\"005816D6\">".replaceAll( regex, "" ) );
    assertEquals(
            "bob   kuhar",
            "bob wsp:rsidP=\"005816D6\" wsp:rsidRDefault=\"005816D6\" kuhar".replaceAll( regex, "" ) );
    assertEquals(
            " keepme=\"yes\" ",
            "wsp:rsidP=\"005816D6\" keepme=\"yes\" wsp:rsidR=\"005816D6\"".replaceAll( regex, "" ) );
    assertEquals(
            "<node a=\"l\"  b=\"m\"  c=\"r\">",
            "<node a=\"l\" wsp:rsidP=\"0\" b=\"m\" wsp:rsidR=\"0\" c=\"r\">".replaceAll( regex, "" ) );
    // Sadly doesn't handle the embedded \" case...
    // assertEquals( "", "wsp:rsidR=\"hello\\\"world\"".replaceAll( regex, "" ) );
}
Bob Kuhar
  • 10,838
  • 11
  • 62
  • 115
  • it works now. regex made non-greedy as per Bohemian derision er suggestion. – Bob Kuhar Dec 31 '11 at 01:29
  • I'll remove the -1 when a) you remove yours, and b) you fix your answer (the first regex is the still old broken one) – Bohemian Dec 31 '11 at 03:55
  • My answer works, or at least meets the requirements of the original question. I don't see a need for it being edited. I really don't care much about the -1. – Bob Kuhar Dec 31 '11 at 17:58
1

Try:

xmlstring.replaceAll("\\bwsp:rsid\\w*=\"[^\"]+(\\\\\"[^\"]*)*\"", "");

Also, your regexes are wrong. I suggest you go and plough through http://regular-expressions.info ;)

fge
  • 119,121
  • 33
  • 254
  • 329
  • don't you mean `"\\bwsp:rsid=\"[^\"]+\""`? – Bohemian Dec 30 '11 at 19:08
  • No, since `rsid` can be followed by `R` for instance. – fge Dec 30 '11 at 20:09
  • Let me put it this way... your regex doesn't work. Test it yourself to see. – Bohemian Dec 30 '11 at 23:55
  • And your regex won't work with `wsp:rsidR="hello\"world"` either. Meh. My edited one will, however. – fge Dec 31 '11 at 00:32
  • FYI, to work with `"hello\"world"` you'd need an XML parser, which is beyond the scope of this question. See [You shouldn't try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) for why – Bohemian Dec 31 '11 at 03:53
  • Accommodating backslash-escaped quotes in the attribute value is no problem, but you need a lot more backslashes (this being Java): `\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"`. But I don't think the OP was asking about that; `\"[^\"]*\"` should suffice. – Alan Moore Dec 31 '11 at 06:11
  • @Bohemian no, it does not need an XML parser: `"[^"]*(\\"[^"]*)*"` will match it no problem. Try it. – fge Dec 31 '11 at 10:02
0

Here are 2 functions. clean will do the replacement, extract will extract the data (if you want it, not sure)

Please excuse the style, I wanted you to be able to cut and paste the functions.

import java.util.HashMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class Answer {

    public static HashMap<String, String> extract(String s){
        Pattern pattern  = Pattern.compile("wsp:rsid(.+?)=\"(.+?)\"");
        Matcher matcher = pattern.matcher(s);
        HashMap<String, String> hm = new HashMap<String, String>();

        //The first group is the string between the wsp:rsid and the =
        //The second is the value
        while (matcher.find()){
            hm.put(matcher.group(1), matcher.group(2));
        }

        return hm;
    }

    public static String clean(String s){
        Pattern pattern  = Pattern.compile("wsp:rsid(.+?)=\"(.+?)\"");
        Matcher matcher = pattern.matcher(s);
        return matcher.replaceAll("");
    }

    public static void main(String[] args) {

        System.out.print(clean("sadfasdfchri wsp:rsidP=\"005816D6\" foo=\"bar\" wsp:rsidR=\"005816D6\" wsp:rsidRDefault=\"005816D6\""));
        HashMap<String, String> m = extract("sadfasdfchri wsp:rsidP=\"005816D6\" foo=\"bar\" wsp:rsidR=\"005816D6\" wsp:rsidRDefault=\"005816D6\"");
        System.out.println("");

        //ripped off of http://stackoverflow.com/questions/1066589/java-iterate-through-hashmap
        for (String key : m.keySet()) {
            System.out.println("Key: " + key + ", Value: " + m.get(key));
        }

    }   

}

returns:

sadfasdfchri  foo="bar"

Key: RDefault, Value: 005816D6

Key: P, Value: 005816D6

Key: R, Value: 005816D6
Chris Everitt
  • 166
  • 2
  • 9
  • This is an appalling solution... "more code" does not mean "better code". The "correct" answer is a one liner. – Bohemian Dec 30 '11 at 23:58
  • Most is boiler plate. Of course the answer is one line, actually just one regex. The correct implementation of the answer is not one line of code. We don't know what that is. I provided mine. – Chris Everitt Jan 06 '12 at 19:29
0

Unlike all other answers, this answer actually works!

xmlstring.replaceAll("\\bwsp:rsid\\w*?=\"[^\"]*\"", "");

Here's a test that fails with all other answers:

public static void main(String[] args) {
    String xmlstring = "<tag wsp:rsidR=\"005816D6\" foo=\"bar\" wsp:rsidRDefault=\"005816D6\">hello</tag>";
    System.out.println(xmlstring);
    System.out.println(xmlstring.replaceAll("\\bwsp:rsid\\w*?=\"[^\"]*\"", ""));
}

Output:

<tag wsp:rsidR="005816D6" foo="bar" wsp:rsidRDefault="005816D6">hello</tag>
<tag  foo="bar" >hello</tag>
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 1
    SO won't let me remove the -1 unless the answer takes an edit. Make an edit and I'll put it back. I still think your style lacks the objectivity that make SO work so well. – Bob Kuhar Dec 31 '11 at 15:47