c# regex to replace on last occurence of pattern

Question

I built an extension to convert HTML formatted text to something better for a list view. It removes all HTML tags except it replaces <h> and <p>s with <br /> to keep readability on the list view. It also shortens the text for longer posts. I put it on my razor view with HTML.Raw(model.text).

public static string FixHTML(string input, int? strLen)
        {
            string s = input.Trim();
            s = Regex.Replace(s, "</p.*?>", "<br />");
            s = Regex.Replace(s, "</h.*?>", "<br />");
            s = s.Replace("<br />", "*ret$990^&");
            s = Regex.Replace(s, "<.*?>", String.Empty);
            s = Regex.Replace(s, "</.*", String.Empty);
            s = s.Replace("*ret$990^&", "<br />");
            int i = (strLen ?? s.Length);
            s = s.Substring(0,(i > s.Length ? s.Length : i));
            return(s);
        }

PROBLEM: if the last character gets cut off mid <br /> it messes up the displayed text. Example it gets cut off at blah blah blah <br then the display isnt nice. How can I use REGEX (or even string replace) to find only the last occurence of <b.... and only if it doesnt have a closing >.

I was thinking of something like:

s = string.Format(s.Substring(0, s.Length-6) + Regex.Replace(s.Substring(s.Length - 6), "<.*", string.Empty));

That will probably work but my whole converter seems like it is using a to of code to do something that should be relatively simple.

How can I do this?

Is there anything that IS recommended to "clean" HTML? What I am doing above works, but I agree its not pretty. — dave317, Jan 18 '18 at 20:40
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Lews Therin, Jan 18 '18 at 21:26
I would suggest a library such as [HtmlAgilityPack](https://www.nuget.org/packages/HtmlAgilityPack) to parse through and change your HTML — Mike Kuenzi, Jan 18 '18 at 22:21

score 2 · Accepted Answer · answered Jan 18 '18 at 22:15

2

Try this:

s = Regex.Replace(s, "(<|<b|<br|<br/)$", "", RegexOptions.None);

answered Jan 18 '18 at 22:15

SBFrancies

3,987
2
14
37

An alternate regex that would catch all incomplete html tags (not just `br`) at the end of a string would be `"<[^>]*$"`. – Rudism Jan 19 '18 at 00:41
@Rudism - definitely a good solution, the only problem might be if the "<" character appeared in the text not as part of a tag – SBFrancies Jan 19 '18 at 00:53

c# regex to replace on last occurence of pattern

1 Answers1