1

We receive an XML String, where we need to sanitize only one attribute value before unmarshalling it. Problem is that xml is very loosely typed and there is no guarantee that attributes will be in any certain order or even present.

<message>
 <set name=".." value="garbled string" type="name" />
 <set age=".." value="32" />
 <set something=".." value="value=\"\"\"\"" />
 ..
</message>

In this String I need to call a pattern such that I only take the string for the XML's value attribute, encode any special characters (StringEscapeUtils.escapeXml()) and replace its value. Even if the value contains a string "value" inside should not cause any regex pattern mismatches.

Please help.

teobais
  • 2,820
  • 1
  • 24
  • 36
Ashish
  • 47
  • 1
  • 3
    Applying regex to XML (or similar non-regular problem domains) is a receipe for disaster. Better use an XML parser. – Thomas Nov 20 '15 at 09:08
  • Thanks @Thomas but xml parsers will either fail or pass parsing the passed string. What I need to do is to escape any special characters with in the value attribute and then parse it. Do you have an example I can use that shows it without using regex? – Ashish Nov 20 '15 at 09:12
  • 4
    If the XML you get is really that crappy it's _really_ hard to come up with a regex that works in all cases (assume thinks like `name="value="` etc.). So if the XML isn't even valid and thus makes parsers fail, I'd first try to talk to the sender. – Thomas Nov 20 '15 at 09:17
  • 1
    I agree with Thomas. What you are receiving is _not_ XML at all it seems, e.g. [this answer](http://stackoverflow.com/a/6023837/1987598) elaborates on this. – Mathias Müller Nov 20 '15 at 09:56

2 Answers2

0

You can use the regular expression (?<=value\=")(?:[^"\\<]|\\"|\\\\)++(?=") in combination with Matcher#find() to find all values of the XML attribute value.

String input = "<message>\n <set name=\"..\" value=\"garbled string\" type=\"name\" />\n <set age=\"..\" value=\"32\" />\n <set something=\"..\" value=\"value=\\\"\\\"\\\"\\\"\" />\n ..\n</message>";
Pattern pattern = Pattern.compile("(?<=value\\=\")(?:[^\"\\\\<]|\\\\\"|\\\\\\\\|\\\\<)++(?=\")");
Matcher matcher = pattern.matcher(input);
StringBuilder convertedInput = new StringBuilder();

int trailing = 0;
while (matcher.find()) {
    String value = matcher.group();
    String convertedValue = StringEscapeUtils.escapeXml(value);

    convertedInput.append(input.substring(trailing, matcher.start()));
    convertedInput.append(convertedValue);

    trailing = matcher.end();
}

if (trailing < input.length()) {
    convertedInput.append(input.substring(trailing, input.length()));
}

System.out.println(convertedInput);

When run, convertedInput should contain input with - depending on the functionality of StringEscapeUtils#escapeXml(String) - all values of each value attribute being escaped XML strings. I added < to the characters that must not be contained in a value without backslash escape because otherwise, attributes like name="value=" (thanks to @Thomas for pointing this out in a comment) would cause the regular expression to go haywire.

For details on the used regular expression, please visit this link.

mezzodrinker
  • 998
  • 10
  • 28
0

I had to do something like that in the recent past (i.e. encode special chars in order to let the unmarshaller/parser do its job). The solution I came up with is the following :

  • use a streaming parser (I used woodstox)
  • Give the streaming parser a custom java.io.FilterReader
  • Implement the FilterReader's read method, so that it encodes the special characters when they are read, i.e. something like this :

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
    
        int charsWithoutEntity = len / 4;
        int read = super.read(myBuffer, off, charsWithoutEntity <= myBuffer.length ? charsWithoutEntity : myBuffer.length);
        int j = 0;
    
        for (int i = 0; i < read; i++, j++) {
    
            cbuf[j] = myBuffer[i];
            if (myBuffer[i] == '&') {
                cbuf[++j] = 'a';
                cbuf[++j] = 'm';
                cbuf[++j] = 'p';
                cbuf[++j] = ';';
            }
        }
    
        return read > 0 ? j : read;
    } 
    

The reasons I chose a streaming parser are independent from this problem, and I'm pretty sure you can give the FilterReader to JAXB's Unmarshaller, so the same solution may apply also in case you don't want/need to use a parser.

francesco foresti
  • 2,004
  • 20
  • 24