0

I got a string preview class which takes a Html string from the database or just plain old html string and outputs a preview of x characters.... Now my boss asked me to convert it into regex, and I been striking a wall for a while now. If anyone can help me with that.

The specific part that mostly concerns me is getting x characters without including tags in the count but not killing the tags either.

I would love if anyone has anything i read on or a codeplex thing.

Reza
  • 1
  • Why do you want to use Regex ? Are there problems with your current implementation ? – driis Feb 21 '11 at 17:14
  • 7
    Please choose a different boss. This task is impossible to do with regexes. – Tim Pietzcker Feb 21 '11 at 17:15
  • 6
    You may propose the following read to your boss: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 I am sure he will enjoy it and make him thinking twice the next time he asks someone to parse HTML with Regex. – Darin Dimitrov Feb 21 '11 at 17:15
  • 2
    +1 for posting the obligatory parse-html-with-regex answer. I couldn't find it :-) – driis Feb 21 '11 at 17:18
  • I've never had a boss which understands html tags, regex or how to count characters in a string. I'd love a tech savvy boss like yours! – Mikael Svenson Feb 21 '11 at 19:46
  • @Mikael Svenson, you call this a *tech savvy* boss? I wouldn't :-) – Darin Dimitrov Feb 21 '11 at 21:15
  • @Darin, of course he is tech savvy. He knows the language of grunts ;) With this knowledge he will make sure the turnover of his employees is around 6 months. New fresh guys are cheaper :) – Mikael Svenson Feb 22 '11 at 10:36
  • Can you post a example please ? It's difficult to understand what you want. Do you want to preserve the tags and offer a preview of the text only ? – Stephan Mar 25 '11 at 12:47

1 Answers1

0

The task is simple my friend... sounds like an interesting boss.

void Main()
{
    string test = "<html>wowzers description: none <div>description:a1fj391</div></html>";
    string result = getFirstChars(test, 15);
    Console.WriteLine(result);  

    //result: wowzers descrip
}

static Regex MyRegex = new Regex(
      "(?<tag></?\\s*\\w+\\s*>*)",
    RegexOptions.Compiled);

static string getFirstChars(string html, int count)
{
    string nonTagText = MyRegex.Replace(html,"");
    return nonTagText.Substring(0, count);
}

if you want to keep tags... then you could do this:

void Main()
{
    string test = "<html><b>wowzers</b> description: none <div>description:a1fj391</div></html>";
    string result = getFirstChars(test, 15);
    Console.WriteLine(result);  

    //result: <html><b>wowzers</b> descrip
}

static Regex MyRegex = new Regex(
       "(?<tag></?\\s*\\w+\\s*>)(?<content>[^<]*)",
    RegexOptions.Compiled);

static string getFirstChars(string html, int count)
{
    int totalCount = 0;
    int contentCount = 0;
    foreach(Match match in MyRegex.Matches(html))
    {
        contentCount += match.Groups["content"].Length;
        totalCount += match.Length;
        if(contentCount >= count)
        {
            totalCount -= contentCount - count;
            break;
        }
    }

    return html.Substring(0, totalCount);
}
Marcel Valdez Orozco
  • 2,985
  • 1
  • 25
  • 24
  • But if you want to "keep the tags" in the output, its more complex, you'd need to create 'tag groups' and include only the tags that made it into the char count. – Marcel Valdez Orozco Oct 25 '11 at 11:46