0

Here is the problem. I have a block of pasted html text. I need to remove trailing line breaks and white space from the text. Even ones proceeded by closing tags. The below text is simply an example, and actually closely represents the real text I'm dealing with.

EG:

This:

<span>Here is some<br></span><br> <span><span>Here is some text</span><br><span><br>&nbsp; </span></span><br><br>

Becomes this:

<span>Here is some<br></span><br> <span><span>Here is some text<span></span></span>

My first pass. I use this: Regex.Replace(htmlString, @"(?:\<br\s*?\>)*$", "") to get rid of the trailing line breaks. Now all I have left is the line breaks stuck behind closing tags and white space.

I'm attempting to use this:

While(Regex.IsMatch(@"(<br>|\s|&nbsp;)*(<[^>]*>)*$") { Regex.Replace(htmlString, @"(<br>|\s|&nbsp;)*(<[^>]*>)*$", $2) }

The regex pattern is actually working great, the problem is that the substitute by matched group 2 is only giving back a single closing span. So that I end up with the below:

<span>Here is some<br></span><br> <span><span>Here is some text</span></span>

michcoth
  • 65
  • 6
  • maaaaaaaaaaaaan ... i so expected to see [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) here ... – Noctis May 18 '17 at 00:58

2 Answers2

0

I guess you can use:

resultString = Regex.Replace(subjectString, @"<br>|&nbsp;|\n", "");

Regex Demo

Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
0

The regular expression is in @"(<br>|\s|&nbsp;)*(<[^>]*>)*$". The second group is followed by a * meaning the group is repeated and so the $2 only yields one repetition of the group.

Putting the repetition in a group will capture the whole repetition. Change the regular expression to be @"(<br>|\s|&nbsp;)*((<[^>]*>)*)$".

Note that repeating the first group with a * may make the code spin on some input strings as there no guarantee that the Replace will change the text to a different string. As the first group is optional (ie zero or more repeats) the Replace might replace one string with exactly the same string. So I suggest changing the regular expression to be @"(<br>|\s|&nbsp;)+((<[^>]*>)*)$" meaning that one or more occurrences of the first group are required.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
  • Thank you Adriann! Seriously you saved the day for me. That part about grouping in the repetition was what I really needed, and also I did experience a looping problem until I I added the one or more occurrence. Seriously, I couldn't thank you enough. – michcoth May 18 '17 at 13:10