Remove everything after constant using regex

Question

I've got XML that has additional information, BLAH, in each tag. When creating the tags, I've separated the extra info from the tag name with a constant (XMLSPLIT as constant XML_SPLITTER)... I needed to do this because I'm generating my XML from a JSON object and I can't have multiple keys that are the same thing... but in the XML output, can't have that superfluous stuff.

For example:

....
<SetXMLSPLITBLAH>
    <Value>9</Value>
    <SetType>
        <Name>Foo</Name>
    </SetType>
</SetXMLSPLITBLAH>
...

So, after generating the XML, I go through and clean it. I'm trying to do it with a regex. I figure, I want to remove anything on a line after the splitter and replace it with just the >.

let reg = new RegExp("<Set"+XML_SPLITTER+"(.*)\/g");
cleanXML = dirtyXML.replace(reg, "<Set>")

This fails to work.

I will note, that I reg = /<Set(.*)/g; and that worked just fine... but it also captures "SetType" and any other use of a tag that starts with "

Use a parser please. [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/3600709) — ctwheels, Sep 27 '19 at 19:51
The meaning of the XML tags in this instance don't matter... I just need to find a way to change one string to another... As far as the question is concerned, it doesn't matter that it's XML, HTML, or Lorem Ipsum for that matter... — lowcrawler, Sep 27 '19 at 19:58
That clears it a bit; is the `XMLSPLITBLAH` portion always uppercase? — ctwheels, Sep 27 '19 at 20:06
The `XMLSPLIT` text comes from the constant `XML_SPLITTER`. The stuff that comes after will be a mix of alphnumerics (upper, lower, numbers). I really just need to remove everything between a constant (`XML_SPLITTER`) and a `>`. — lowcrawler, Sep 27 '19 at 20:11

score 1 · Accepted Answer · answered Sep 27 '19 at 19:53

1

It's because ^ is a Regex special character that indicates "beginning of line". You'd need to escape it like \^ for this to work. Something like /<Set\^\^[^>]*>/g should do the trick.

Small note: The above regex assumes that the "BLAH" string in your example will never contain the > character... but if it does, then your XML is super malformed anyway.

answered Sep 27 '19 at 19:53

Declan McKelvey-Hembree

1,141
6
16

I can just change the XML_SPLITTER to "XMLSPLIT" so as to avoid any special characters. I then have the regex ... built like this: `reg = new RegExp("]*>\/g");` It doesn't seem to work. Help? – lowcrawler Sep 27 '19 at 20:01
You can't put the global flag inside the string like that. Correct way is `new RegExp("<\/?Set"+XML_SPLITTER+"[^>]*>", "g");` See this fiddle: https://jsfiddle.net/fouzrk6h/ – Declan McKelvey-Hembree Sep 27 '19 at 20:10
1

Oh, whoops, created a slight bug. Fixed the code to `reg = new RegExp(XML_SPLITTER+"[^>]*>", "g"); cleanXML = dirtyXML.replace(reg, ">");` and updated fiddle: https://jsfiddle.net/zcnkt7qs/. That should work. – Declan McKelvey-Hembree Sep 27 '19 at 20:13
My hero. That worked. I had spit out the reg as a string and saw it had an odd extra trailing slash. Your switch fixed it. – lowcrawler Sep 27 '19 at 20:26

ctwheels · Answer 2 · 2019-09-27T20:33:25.007

1

Using .* will match > and if - for some reason - your XML file is not broken up into multiple lines (i.e. minified), you'll match more than you should. To avoid this, you can use [^>]* to match everything up to the >.

Since you've gracefully included a splitter, it'll make matching much easier and much more predictable (as you mentioned, you match SetType without a splitter).

Without a splitter, you'd have to use a regex pattern that resembles <Set(?!Type>)[^>]* or <Set(?!(?:Type|SomethingElse)>)[^>]* if you had more than just one suffix to Set that should remain. These methods use a negative lookahead to assert what follows does not match.

var str = `<SetXMLSPLITBLAH>
    <Value>9</Value>
    <SetType>
        <Name>Foo</Name>
    </SetType>
</SetXMLSPLITBLAH>`

var XML_SPLITTER = 'XMLSPLIT'
var p = `(</?)Set${XML_SPLITTER}[^>]*`
var r = new RegExp(p,'g')
x = str.replace(r,'$1Set')

console.log(x)

edited Sep 27 '19 at 20:33

answered Sep 27 '19 at 20:26

ctwheels

21,901
9
42
77

This also works. Thank you for the additional explanations! – lowcrawler Sep 27 '19 at 20:28
Just wanted to point out that this suffers from the same bug my answer previously did -- you're replacing instances of ` – Declan McKelvey-Hembree Sep 27 '19 at 20:30
@DeclanMcKelvey-Hembree Thank you, fixed. – ctwheels Sep 27 '19 at 20:33
It seems to me (perhaps I'm wrong) that if I'm just concientious about using XML_SPLITTER, I should be able to just grab everything from that constant out through the next `>` and take care of everything. For example: `let reg = new RegExp(XML_SPLITTER+"[^>]*>", "g");` – lowcrawler Sep 27 '19 at 20:35
1

Yep, that's what I ended up doing in my fiddle. @ctwheels' solution of using capture groups is also totally valid though. Depends on whether you want it to work on Set XML elements with XMLSPLIT only or all elements with XMLSPLIT. – Declan McKelvey-Hembree Sep 27 '19 at 20:36
1

@lowcrawler yes. You don't even need the `>` at the end of the regex since it will stop there anyway. If you don't care about it being explicitly `]*`; if you wanted to add support for `/>` (self-terminating tags) you can change it to `${XML_SPLITTER}(?:(?!\/?>)[^>])*` as seen [here](https://regex101.com/r/qCXNMg/1) – ctwheels Sep 27 '19 at 20:37

Remove everything after constant using regex

2 Answers2