Regexp replace in XML

Question

I'm new to using XML and have not had much training. I am trying to correctly format the text in a custom report. I have this line:

.replace(/(&lt;([^>]+)>)/ig, "\n")

and would like to fully understand what it is doing. I know that a new line is replacing what is being found in the parenthesis. Specifically, what is this looking for?

([^>]+)>)

Edit (from comments):

Here is the full expression (reformatted for readability).

<expression name="expression" type="javascript">
  (
    dataSetRow["Question_Employee_Comment"] +
    dataSetRow["Question_‌Manager_Comment"]
  )
    .replace(/(&lt;([^>]+)>)/ig, "\n")
    .replace(/null/ig, "")
    .replace(/&amp;amp;/g, "&amp;")
    .replace(/&amp;#39;/g,"'")
    .replace(/&amp;nbsp;/g," ")
    .replace(/•/g,'\n•')
</expression>

And here is the XML that this expression is looking at (wrapped for readability):

<wd:Question_Employee_Comment>
    &lt;p>I don't even know where to start... Cupid wasn't @ his desk on 2/14/2015
    and I'm really upset because I've been really patient with his personal needs.
    Santa &amp;amp; I sat him down and have discussed why his attendance is important
    to success.&lt;/p>&lt;p>&lt;/p>&lt;p>He's been absent
    on:&lt;/p>&lt;ul>&lt;li>3/19/15&lt;/li>&lt;li>March 20,
    2015&lt;/li>&lt;li>05/01/2015&lt;/li>/ul>&lt;p>&lt;/p>&lt;p>All
    additional dates will be documented.&lt;/p>
</wd:Question_Employee_Comment>

is right .replace(/(<([^>]+)>)/ig, "\n")? wouldn't be .replace(/(<([^>]+)>)/ig, "\n")? — Polak, Jan 04 '16 at 23:26

Dan Lowe · Accepted Answer · 2016-01-05T14:38:24.130

This regular expression (or regexp) can be broken down as follows.

(&lt;([^>]+)>)

The parentheses are for grouping.

Sometimes they are used to memorize matches to use in later work, though I see no evidence that is happening in this limited sample of the code.

Sometimes they are used to allow multiple alternative choices (e.g. (a|b|c), but I don't see that here either.

Since the parentheses don't do anything in this expression, at least not as far as matching, let's ignore them. That leaves this:

&lt;[^>]+>

Half of this are just literal characters to match. The beginning of the match must be the literal 4-character string <, and the end of the string is the literal character >. In the middle is the only regexp bit.

[^>]+

The square brackets denote a character class. Inside a character class, if ^ is the first character, as it is here, then it is an inverse character class, that is, it means "match things that are not these things". So, this character class says "match things that are not a >."

The + after the character class is called a quantifier, and it means "one or more of this thing".

So, taken together it means "one or more things that are not a >."

The entire expression means: match < followed by one or more things that are not >, followed by a >.

After the expression are two flags, i and g. The i means match case-insensitively. It doesn't do anything here, because your expression has no match characters that are alphabetic. The g flag means to match globally, that is, if there is more than one match against the input, match them all instead of matching only in the first case.

Now, looking at your example XML, I believe the expression would make a number of edits. Note that you posted the content of <wd:Question_Employee_Comment> only, but the expression is actually operating on both that and the content of <wd:Question_Manager_Comment>, if that has a value. I won't remark on <wd:Question_Manager_Comment> here, because you didn't post what it contains.

The leading  just before I don't even will be replaced by a newline.
Just after important to success, the  will be replaced by 4 newlines.
Just after absent on, the <ul><li> will be replaced by 3 newlines.
Just after 3/19/15, the </li><li> would be replaced by 2 newlines.
Just after March 20, 2015, the </li><li> would be replaced by 2 newlines.
Just after 5/01/2015, the </li> would be replaced by a newline.
Just before All additional, the ` would be replaced by 3 newlines.
At the end,  would be replaced by a newline.

Note that there is a partial tag in there that is missed by the expression, /ul>.

Result:

<wd:Question_Employee_Comment>
    \nI don't even know where to start... Cupid wasn't @ his desk on 2/14/2015
    and I'm really upset because I've been really patient with his personal needs.
    Santa &amp;amp; I sat him down and have discussed why his attendance is important
    to success.\n\n\n\nHe's been absent
    on:\n\n\n3/19/15\n\nMarch 20,
    2015\n\n05/01/2015\n/ul>\n\n\nAll
    additional dates will be documented.\n
</wd:Question_Employee_Comment>

That's from the .replace() you specifically asked about. Further work is also done by the full expression, such as fixing &amp; to be &, and other things are done. I haven't made all of those transformations here since those weren't part of the core question you asked, but could elaborate if you don't understand those parts.

Thank you Dan for such a quick and detailed response. That piece of code makes better sense now. — Sonya B, Jan 05 '16 at 13:48
Here is the full expression: `code` (dataSetRow["Question_Employee_Comment"]+dataSetRow["Question_Manager_Comment"]).replace(/(<([^>]+)>)/ig, "\n").replace(/null/ig, "").replace(/&/g, "&").replace(/'/g,"'").replace(/ /g," ").replace(/•/g,'\n•') — Sonya B, Jan 05 '16 at 13:55
Here is the xml that the expression is looking at: `code` I don't even know where to start... Cupid wasn't @ his desk on 2/14/2015 and I'm really upset because I've been really patient with his personal needs. Santa & I sat him down and have discussed why his attendance is important to success.He's been absent on:<ul><li>3/19/15</li><li>March 20, 2015</li><li>05/01/2015</li>/ul>All additional dates will be documented. — Sonya B, Jan 05 '16 at 14:00
so based on my understanding of your explanation, Dan.....the .replace(/<([^>]+)>/ig, "\n") will add 2 new line characters in front of the text "He's been absent on"...is that correct? and 3 new line characters in front of the "March 20, 2015" text? — Sonya B, Jan 05 '16 at 14:05
@Sonya Yes, but it would do more than that - see my expanded answer. — Dan Lowe, Jan 05 '16 at 14:39
Many Many Many Thanks Dan! This is a perfect answer! I appreciate your willingness to help a newbie. I did enough research on the web to figure out the other .replace transformations and added a couple to the expression myself. I am still stumped on an issue with bullets when the text is keyed directly into the form versus when it's pasted into the form BUT I will create a new question because this one has been fully answered. — Sonya B, Jan 05 '16 at 14:53

kjhughes · Answer 2 · 2017-12-09T15:32:33.323

That replace function will replace all XML tags with new line characters, leaving behind pure text without any markup.

Notes:

The replace function is meant to be applied to XML; it is not XML itself.
It uses a regular expression to match an XML tag. See Dan's answer for a great description of the constructs in the regular expression.
Regex is fundamentally the wrong way to process XML. Use a real XML parser or XPath instead.

Regexp replace in XML

2 Answers2