-1

Below is a sample of XML i receive, i need to replace few special characters in the attributes and send it over ( xmlString.replaceAll("\[^A-Za-z0-9#&',-.\]", "")), please refer to the last attribute for example

Is there a way to iterate over each node (XML attribute/node names are not fixed), then apply the regex only to the value part of attribute and rebuild the xml?

converting to string and applying the regex doesn't work always

open to any approaches in Java.

<AccountNumberId>JY00000830</AccountNumberId>
<XYZ:CompanyCd>DOC</XYZ:CompanyCd>
<XYZ:MultiPolicyDiscountCd>0</XYZ:MultiPolicyDiscountCd>
<QuestionAnswer>
<QuestionCd>XYZ:1</QuestionCd>
<YesNoCd>No</YesNoCd>
</QuestionAnswer>
<TransactionSeqNumber/>
<PersApplicationInfo>
<ApplicationWrittenDt>2023-02-26</ApplicationWrittenDt>
<KnownSinceDt>2007-02-05</KnownSinceDt>
</PersApplicationInfo>
<XYZ:TaxExemptionInd>0</XYZ:TaxExemptionInd>
</PersPolicy>
<Location id="LOC-1">
<ItemIdInfo>
<XYZ:FixedId>8001</XYZ:FixedId>
</ItemIdInfo>
<Addr>
<Addr1>`**`1234 $$$RIVERWOOD !!<GATE SUITE> 136`**`</Addr1>
...
var escapedXml = StringEscapeUtils.escapeXml(xmlString);
var replaceSplChars = escapedXml
  .replaceAll("[^A-Za-z0-9#&',-.\n</>]", "")
  .replace("\t", "");
var toXML = StringEscapeUtils.unescapeXml(replaceSplChars);

above approach doesn't help, since xml structure has attributes like "<XYZ:", and i end up removing ":"

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
manrk
  • 1

1 Answers1

0

The first argument to replaceAll is a regular expression pattern. The 'regular' in 'regular expression' refers to an entire class of grammars. The point being:

If a grammar is not regular, then regular expressions cannot be used to read/modify anything written in that grammar!

And XML is not regular. Hence, you can't do this. At all. No matter what regex you care to come up with, I can create valid XML that fulfills any XML-based spec that your regexp will not properly parse or modify.

The solution involves one of two options:

  1. Use an actual XML parser to read this data. Here is a tutorial that covers all the popular ones.
  2. Modify the purpose of your code. Instead of 'reads in some XML and makes these modifications to it', which cannot be done with regular expressions, instead be specific: Reads a certain very specifically formatted kind of XML and makes these modifications to it, but will do arbitrary weird stuff and mangle your XML if you don't adhere to the contract and send valid XML that doesn't fit the rules set forth in this spec.

Perhaps you want option 2, but then you have to update the question and list precisely what you have in mind. option 2 is a very bad idea - XML strongly suggests, well, 'valid XML format is fine', it's going to end up confusing somebody if your app takes in XML, but, actually, only very specifically formatted XML.

Note that the XML you pasted isn't valid; that <GATE SUITE> part is wrong. Whatever code is making this XML is broken, probably because you made the same mistake there (just using basic text processing code such as .substring, string concatenation, and regular expressions, to make XML. Using actual XML builders, this would never happen). Instead of layering mistake on top of mistake, go back to the erroneous code that made this broken XML and fix it there.

If you must fix this specific stuff, your only real option is to scan for <Addr1> and </Addr1> and apply your modifications solely to the stuff within using e.g. .substring - given that the XML is invalid you can't parse it as XML (The parser would just throw an exception at you, correctly), and this way at least you've reduced your likely-going-to-cause-issues modifications to a smaller section. That's a matter of last resort and requires a ton of comments explaining that you're working around an existing large problem and that this code is likely to break sooner rather than later.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72