1

I use following piece of code to convert the string to SENTENCE Case.

var sentenceRegex = new Regex(@"(^[a-z])|[?!.:;]\s+(.)", RegexOptions.ExplicitCapture);
var result = sentenceRegex.Replace(toConvert.ToLower(), s => s.Value.ToUpper());

However it fails in-cases when the Sentence starts with HTML_TAGS as shown in the example below.

I want to skip the HTML Tags and convert the text to SENTENCE CASE. Current Text :

<BOLD_HTML_TAG>lorem ipsum is simply dummy</BOLD_HTML_TAG> text of the printing and typesetting industry.
<PARAGRAPH_TAG>LOREM ipsum has been the industry's standard dummy
textever since the 1500s</PARAGRAPH_TAG>.

After Sentence Casing Output Should be as follows :

<BOLD_HTML_TAG>Lorem ipsum is simply dummy</BOLD_HTML_TAG> text of the
printing and typesetting industry. <PARAGRAPH_TAG>Lorem ipsum has been
the industry's standard dummy textever since the
1500s</PARAGRAPH_TAG>.

I would appreciate if someone can help me the regex I should be using to ignore(not remove it) the HTML tags from the string and convert the string to SENTENCE CASE.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Tub
  • 47
  • 7
  • Something like [`(^[a-z])|[?!.:;]\s+((?:<[^<]*>)?.)`](http://regexstorm.net/tester?p=(%5e%5ba-z%5d)%7c%5b%3f!.%3a%3b%5d%5cs%2b((%3f%3a%3c%5b%5e%3c%5d*%3e)%3f.)&i=%3cBOLD_HTML_TAG%3elorem+ipsum+is+simply+dummy%3c%2fBOLD_HTML_TAG%3e+text+of+the+printing+and+typesetting+industry.%0d%0a%3cPARAGRAPH_TAG%3elorem+ipsum+has+been+the+industry%27s+standard+dummy%0d%0atextever+since+the+1500s%3c%2fPARAGRAPH_TAG%3e.)? This assumes your tags are always uppercase. And does not account for more than 1 but it is easy to fix by adding `(?:\s*<[^<]*>)*`. – Wiktor Stribiżew Aug 13 '15 at 11:27
  • Are those genuinely the only tags that could possibly appear, or are there a load of other tags that you might need to handle (e.g., say, ``) – Matthew Watson Aug 13 '15 at 11:27
  • Following tags can appear p|b|br|li|ul|ol|u|i|strong|h1|h2|h3|h4|h5|h6 – Tub Aug 13 '15 at 11:29
  • (I don't think this question is a duplicate of [the proposed duplicate](http://stackoverflow.com/questions/3141426/net-method-to-convert-a-string-to-sentence-case) since this question references the answer to that question.) – Matthew Watson Aug 13 '15 at 11:30
  • @MatthewWatson Indeed, I missed the HTML tags part. – DavidG Aug 13 '15 at 11:30
  • @Tub What would you want to happen with this `Lorem ipsum dummy text`? – DavidG Aug 13 '15 at 11:32
  • Refer to this answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Seriously though, You're probably better off picking out the fragments of text from the html using html parsing techniques, THEN applying regex to those fragments. – Flynn1179 Aug 13 '15 at 12:02
  • @MatthewWatson : Thank you for marking it as not duplicate. – Tub Aug 13 '15 at 13:40
  • @DavidG : The text will continue to be as **Lorem ipsum dummy text** Considering I wanted to convert this to Sentence case – Tub Aug 13 '15 at 14:01

1 Answers1

0

Might not be pretty, but it works ;)

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        string toConvert = "<BOLD_HTML_TAG>lorem ipsum is simply dummy</BOLD_HTML_TAG> text of the printing and typesetting industry."+
                "<PARAGRAPH_TAG>LOREM ipsum has been the industry's standard dummy "+
                "text ever since the 1500s</PARAGRAPH_TAG>.";
        var sentenceRegex = new Regex(@"(?<=<(?<tag>\w+)>).*?(?=</\k<tag>>)", RegexOptions.ExplicitCapture);
        var result = sentenceRegex.Replace(toConvert, s => s.Value.Substring(0,1).ToUpper()+s.Value.ToLower().Substring(1));

        Console.WriteLine(toConvert + "\r\n" + result);
    }
}

The Regex matches the tags using a named group in lookbehind and lookahead, then extracts the string, finally turning the first letter to upper and the rest to lower.

Regards

SamWhan
  • 8,296
  • 1
  • 18
  • 45
  • Thanks for the reply, stripping off the HTML is not what I am expecting. I want to retain the HTML tags and also convert the sentence to SENTENCE CASE. To be specific, I would like to skip the HTML Tag and not remove them. Example : **Before** : lorem ipsum is simply dummy **After** : Lorem ipsum is simply dummy – Tub Aug 13 '15 at 13:52
  • @Tub I'm not sure I follow you... That's exactly what it does. Check [Fiddle](https://dotnetfiddle.net/ILcyh2). – SamWhan Aug 13 '15 at 14:07
  • I am sorry, guess there was an issue executing at my end and so the results failed to show as expected. Let me give it another try. Will let you know ASAP. – Tub Aug 13 '15 at 14:37
  • I appreciate your efforts, the code works partially but would fail in scenarios where there are multiple tags for example [Fiddle](https://dotnetfiddle.net/QLT144) – Tub Aug 13 '15 at 14:54
  • I am really sorry, i am bad at REGEX to see how it fits for multiple tags. – Tub Aug 13 '15 at 15:10
  • Also I just realized, the regex you suggested capitalizes the letter after HTML Tag even if it is not the start of the next sentence. Example : Lorem – Tub Aug 13 '15 at 16:26
  • Well, as you'll find written here at SO a billion times, regex isn't the best way to parse HTML. It works for the simple example you had in your question, but as soon as the complexity goes up, you probably wont be able to do it without a "real" HTML-parser. – SamWhan Aug 13 '15 at 16:46