1

I built an extension to convert HTML formatted text to something better for a list view. It removes all HTML tags except it replaces <h> and <p>s with <br /> to keep readability on the list view. It also shortens the text for longer posts. I put it on my razor view with HTML.Raw(model.text).

public static string FixHTML(string input, int? strLen)
        {
            string s = input.Trim();
            s = Regex.Replace(s, "</p.*?>", "<br />");
            s = Regex.Replace(s, "</h.*?>", "<br />");
            s = s.Replace("<br />", "*ret$990^&");
            s = Regex.Replace(s, "<.*?>", String.Empty);
            s = Regex.Replace(s, "</.*", String.Empty);
            s = s.Replace("*ret$990^&", "<br />");
            int i = (strLen ?? s.Length);
            s = s.Substring(0,(i > s.Length ? s.Length : i));
            return(s);
        }

PROBLEM: if the last character gets cut off mid <br /> it messes up the displayed text. Example it gets cut off at blah blah blah <br then the display isnt nice. How can I use REGEX (or even string replace) to find only the last occurence of <b.... and only if it doesnt have a closing >.

I was thinking of something like:

s = string.Format(s.Substring(0, s.Length-6) + Regex.Replace(s.Substring(s.Length - 6), "<.*", string.Empty));

That will probably work but my whole converter seems like it is using a to of code to do something that should be relatively simple.

How can I do this?

NetMage
  • 26,163
  • 3
  • 34
  • 55
dave317
  • 754
  • 2
  • 12
  • 30
  • 1
    Using regex to parse HTML is not recommended. –  Jan 18 '18 at 20:29
  • Is there anything that IS recommended to "clean" HTML? What I am doing above works, but I agree its not pretty. – dave317 Jan 18 '18 at 20:40
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Lews Therin Jan 18 '18 at 21:26
  • I would suggest a library such as [HtmlAgilityPack](https://www.nuget.org/packages/HtmlAgilityPack) to parse through and change your HTML – Mike Kuenzi Jan 18 '18 at 22:21

1 Answers1

2

Try this:

s = Regex.Replace(s, "(<|<b|<br|<br/)$", "", RegexOptions.None);
SBFrancies
  • 3,987
  • 2
  • 14
  • 37
  • An alternate regex that would catch all incomplete html tags (not just `br`) at the end of a string would be `"<[^>]*$"`. – Rudism Jan 19 '18 at 00:41
  • @Rudism - definitely a good solution, the only problem might be if the "<" character appeared in the text not as part of a tag – SBFrancies Jan 19 '18 at 00:53