0

I am trying to replace elements within an XML string, in the fastest most efficient way possible. Consider the code:

final String rawXml = "
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
<cnpOnlineRequest merchantId=\"017872345\" merchantSdk=\"Java;12.0.0\" version=\"12.0\" xmlns=\"http://www.vantivcnp.com/schema\">
    <authentication>
        <user>AUSER</user>
        <password>pa5Sw0rd!</password>
    </authentication>
    <authorization reportGroup=\"Default Report Group\" id=\"87654321\">
        <orderId>Merchant Order Id</orderId>
        <amount>1299</amount>
        <orderSource>ecommerce</orderSource>
        <billToAddress>
            <addressLine1>5 Some Road</addressLine1>
            <city>Townsville</city>
            <state>Alabama</state>
            <zip>31431</zip>
            <country>US</country>
        </billToAddress>
        <card>
            <type>VI</type>
            <number>1234123412341234</number>
            <expDate>0718</expDate>
            <cardValidationNum>999</cardValidationNum>
            <pin>1234</pin>
        </card>
    </authorization>
</cnpOnlineRequest>";

final String expectedXml = "
<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?>
<cnpOnlineRequest merchantId=\"017872345\" merchantSdk=\"Java;12.0.0\" version=\"12.0\" xmlns=\"http://www.vantivcnp.com/schema\">
    <authentication>
        <user>AUSER</user>
        <password>---sanitised---</password>
    </authentication>
    <authorization reportGroup=\"Default Report Group\" id=\"87654321\">
        <orderId>Merchant Order Id</orderId>
        <amount>1299</amount>
        <orderSource>ecommerce</orderSource>
        <billToAddress>
            <addressLine1>5 Some Road</addressLine1>
            <city>Townsville</city>
            <state>Alabama</state>
            <zip>31431</zip>
            <country>US</country>
        </billToAddress>
        <card>
            <type>VI</type>
            <number>---sanitised---</number>
            <expDate>0718</expDate>
            <cardValidationNum>---sanitised---</cardValidationNum>
            <pin>---sanitised---</pin>
        </card>
    </authorization>
</cnpOnlineRequest>";

final String[] elements = { "password", "number", "cardValidationNum", "pin" };
final Map<String,String> replacements = new LinkedHashMap<>();
for (final String element : elements) {
    final String regexp = String.format("<%s>.*</%s>", element, element);
    final String replacement = String.format("<%s>---sanitised---</%s>", element, element);
    replacements.put(regexp, replacement);
}
final String regexp = "%(" + StringUtils.join(replacements.ketSet(), "|") + ")%";

final Pattern pattern = Pattern.compile(regexp, Pattern.DOTALL);
final Matcher matcher = pattern.matcher(rawXml);
final StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(buffer, replacements.get(matcher.group(1)));
}
final String sanitisedXml = buffer.toString();

assertThat(sanitisedXml, equalTo(expectedXml));

What is happening is that find() does not find anything, so the buffer is empty, and the assert fails. I have also tried replacing "%(" and ")%" with ".*(" and ").*", then find() will work, but there is only 1 group, and it contains the entire string.

Clarification: This must be fast, and there are more elements than those listed. I want to parse the string only once, so replaceAll for each regexp and replacement, is not an option. Neither is unmarshalling the XML into an object, using code to replace all the values, then marshalling the object back into XML.

Marcus MacWilliam
  • 602
  • 1
  • 6
  • 24
  • 3
    Please look into using an XML parser. Regex will not get you very far with nested XML content like this. – Tim Biegeleisen Feb 18 '18 at 14:44
  • It must be fast and efficient, I cannot unmarshal the XML into an object, then replace the values, then marshal the object back into XML. – Marcus MacWilliam Feb 18 '18 at 14:49
  • I could have simply used rawXml.replaceAll(regexp, replacement); for each of the elements I want to sanitise, but this is not efficient, as each replaceAll will parse the entire string. Also there are quite a few more elements than are listed in the small example. – Marcus MacWilliam Feb 18 '18 at 14:51
  • Also forget that the string is XML, I am attempting to parse a string, to find multiple matching substrings, and replace the substrings. If it helps, the list of elements is in the order that they appear in the string. – Marcus MacWilliam Feb 18 '18 at 14:59
  • Really? 4 comments and no one has posted this yet? OK, I'll do the honours... [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/1954610) Parsing XML is not really a slow thing to do... Only use regex if it's a very well-defined "quick hack". – Tom Lord Feb 18 '18 at 15:14
  • do you have to use java? the right tool for xml transformations is [xslt](https://stackoverflow.com/questions/11772125/how-to-modify-xml-file-using-xslt) – Sharon Ben Asher Feb 18 '18 at 15:20
  • You can’t parse XML with regular expressions. XML is not simple enough for it. See https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg. – VGR Feb 18 '18 at 15:57
  • @SharonBenAsher Java [supports XSLT](https://docs.oracle.com/javase/9/docs/api/javax/xml/transform/TransformerFactory.html#newTransformer-javax.xml.transform.Source-). – VGR Feb 18 '18 at 15:59
  • OK, if regex is not suitable to do what I need to do, can someone provide a simple example of how I can sanitise some elements in my XML string, very quickly. – Marcus MacWilliam Feb 18 '18 at 16:21

0 Answers0