0

I've got XML that has additional information, BLAH, in each tag. When creating the tags, I've separated the extra info from the tag name with a constant (XMLSPLIT as constant XML_SPLITTER)... I needed to do this because I'm generating my XML from a JSON object and I can't have multiple keys that are the same thing... but in the XML output, can't have that superfluous stuff.

For example:

....
<SetXMLSPLITBLAH>
    <Value>9</Value>
    <SetType>
        <Name>Foo</Name>
    </SetType>
</SetXMLSPLITBLAH>
...

So, after generating the XML, I go through and clean it. I'm trying to do it with a regex. I figure, I want to remove anything on a line after the splitter and replace it with just the >.

let reg = new RegExp("<Set"+XML_SPLITTER+"(.*)\/g");
cleanXML = dirtyXML.replace(reg, "<Set>")

This fails to work.

I will note, that I reg = /<Set(.*)/g; and that worked just fine... but it also captures "SetType" and any other use of a tag that starts with "

lowcrawler
  • 6,777
  • 9
  • 37
  • 79
  • Use a parser please. [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/3600709) – ctwheels Sep 27 '19 at 19:51
  • The meaning of the XML tags in this instance don't matter... I just need to find a way to change one string to another... As far as the question is concerned, it doesn't matter that it's XML, HTML, or Lorem Ipsum for that matter... – lowcrawler Sep 27 '19 at 19:58
  • That clears it a bit; is the `XMLSPLITBLAH` portion always uppercase? – ctwheels Sep 27 '19 at 20:06
  • The `XMLSPLIT` text comes from the constant `XML_SPLITTER`. The stuff that comes after will be a mix of alphnumerics (upper, lower, numbers). I really just need to remove everything between a constant (`XML_SPLITTER`) and a `>`. – lowcrawler Sep 27 '19 at 20:11

2 Answers2

1

It's because ^ is a Regex special character that indicates "beginning of line". You'd need to escape it like \^ for this to work. Something like /<Set\^\^[^>]*>/g should do the trick.

Small note: The above regex assumes that the "BLAH" string in your example will never contain the > character... but if it does, then your XML is super malformed anyway.

  • I can just change the XML_SPLITTER to "XMLSPLIT" so as to avoid any special characters. I then have the regex ... built like this: `reg = new RegExp("]*>\/g");` It doesn't seem to work. Help? – lowcrawler Sep 27 '19 at 20:01
  • You can't put the global flag inside the string like that. Correct way is `new RegExp("<\/?Set"+XML_SPLITTER+"[^>]*>", "g");` See this fiddle: https://jsfiddle.net/fouzrk6h/ – Declan McKelvey-Hembree Sep 27 '19 at 20:10
  • 1
    Oh, whoops, created a slight bug. Fixed the code to `reg = new RegExp(XML_SPLITTER+"[^>]*>", "g"); cleanXML = dirtyXML.replace(reg, ">");` and updated fiddle: https://jsfiddle.net/zcnkt7qs/. That should work. – Declan McKelvey-Hembree Sep 27 '19 at 20:13
  • My hero. That worked. I had spit out the reg as a string and saw it had an odd extra trailing slash. Your switch fixed it. – lowcrawler Sep 27 '19 at 20:26
1

Using .* will match > and if - for some reason - your XML file is not broken up into multiple lines (i.e. minified), you'll match more than you should. To avoid this, you can use [^>]* to match everything up to the >.

Since you've gracefully included a splitter, it'll make matching much easier and much more predictable (as you mentioned, you match SetType without a splitter).

Without a splitter, you'd have to use a regex pattern that resembles <Set(?!Type>)[^>]* or <Set(?!(?:Type|SomethingElse)>)[^>]* if you had more than just one suffix to Set that should remain. These methods use a negative lookahead to assert what follows does not match.

var str = `<SetXMLSPLITBLAH>
    <Value>9</Value>
    <SetType>
        <Name>Foo</Name>
    </SetType>
</SetXMLSPLITBLAH>`

var XML_SPLITTER = 'XMLSPLIT'
var p = `(</?)Set${XML_SPLITTER}[^>]*`
var r = new RegExp(p,'g')
x = str.replace(r,'$1Set')

console.log(x)
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • This also works. Thank you for the additional explanations! – lowcrawler Sep 27 '19 at 20:28
  • Just wanted to point out that this suffers from the same bug my answer previously did -- you're replacing instances of ` – Declan McKelvey-Hembree Sep 27 '19 at 20:30
  • @DeclanMcKelvey-Hembree Thank you, fixed. – ctwheels Sep 27 '19 at 20:33
  • It seems to me (perhaps I'm wrong) that if I'm just concientious about using XML_SPLITTER, I should be able to just grab everything from that constant out through the next `>` and take care of everything. For example: `let reg = new RegExp(XML_SPLITTER+"[^>]*>", "g");` – lowcrawler Sep 27 '19 at 20:35
  • 1
    Yep, that's what I ended up doing in my fiddle. @ctwheels' solution of using capture groups is also totally valid though. Depends on whether you want it to work on Set XML elements with XMLSPLIT only or all elements with XMLSPLIT. – Declan McKelvey-Hembree Sep 27 '19 at 20:36
  • 1
    @lowcrawler yes. You don't even need the `>` at the end of the regex since it will stop there anyway. If you don't care about it being explicitly `]*`; if you wanted to add support for `/>` (self-terminating tags) you can change it to `${XML_SPLITTER}(?:(?!\/?>)[^>])*` as seen [here](https://regex101.com/r/qCXNMg/1) – ctwheels Sep 27 '19 at 20:37