1

I have a field in my database that holds input from an html input. So I have in my db column data. What I need is to be able to extract this and display a short version of it as an intro. Maybe even the first paragraph if possible.

Kenyana
  • 61
  • 3
  • 7

4 Answers4

1

The Html Agility Pack is usually the recommended way to strip out the HTML. After that it would just be a matter of doing a String.Substring to get the bit of it that you want.

If you need to get out the 2000 first words I suppose you could either use IndexOf to find a whitespace 2000 times and loop through it until then to get the index to use in the call to Substring.

Edit: Add sample method

public int GetIndex(string str, int numberWanted)
{
    int count = 0;
    int index = 1;
    for (; index < str.Length; index++)
    {
         if (char.IsWhiteSpace(str[index - 1]) == true)
         {
              if (char.IsLetterOrDigit(str[index]) == true ||
                    char.IsPunctuation(str[index]))
              {
                    count++;
                    if (count >= numberWanted)
                         break;
              }
         }
    }
    return index;
}

And call it like:

string wordList = "This is a list of a lot of words";
int i = GetIndex(wordList, 5);
string result = wordList.Substring(0, i);
Hans Olsson
  • 54,199
  • 15
  • 94
  • 116
  • I am using the HTML Agility pack and it does strip all the HTML, all I need now would be a code sample to loop thru the string and get the first 2000 words. – Kenyana Jul 05 '10 at 11:18
  • @Kenyana: Added a sample method for that with a sample for how to call it. Not sure if it's very efficient and might not count completely correctly but should at least give you an idea. – Hans Olsson Jul 05 '10 at 11:32
  • This is my sample code! I have it within a class which seems to strip and add back the html elements to display on page. But it doesn't limit to the words I want. NB: I got that code from another thread on this site. How do I post code samples on this page? – Kenyana Jul 05 '10 at 12:17
  • @Kenyana: Doesn't surprise me, I think if you asked a lot of people to do this many would come up with very similar code. Just post the code as text, but prefix each line with 4 spaces. There's a button in the editor that will do it for you if you select all the text first. – Hans Olsson Jul 05 '10 at 12:27
1

Something like this maybe?

    public string Get(string text, int maxWordCount)
    {
        int wordCounter = 0;
        int stringIndex = 0;
        char[] delimiters = new[] { '\n', ' ', ',', '.' };

        while (wordCounter < maxWordCount)
        {
            stringIndex = text.IndexOfAny(delimiters, stringIndex + 1);
            if (stringIndex == -1)
                return text;

            ++wordCounter;
        }

        return text.Substring(0, stringIndex);
    }

It's quite simplified and doesnt handle if multiple delimiters comes after each other (for instance ", "). you might just want to use space as a delimiter.

If you want to get just the first paragraph, simply search after "\r\n\r\n" <-- two line breaks:

    public string GetFirstParagraph(string text)
    {
        int pos = text.IndexOf("\r\n\r\n");
        return pos == -1 ? text : text.Substring(0, pos);
    }

Edit:

A very simplistic way to strip HTML:

return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);
jgauffin
  • 99,844
  • 45
  • 235
  • 372
0

I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:

 Words(string html, int n)

To get n words

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;


namespace UmbracoUtilities
{
    public class Text
    {
      /// <summary>
      /// Return the first n words in the html
      /// </summary>
      /// <param name="html"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string Words(string html, int n)
      {
        string words = html, n_words;

        words = StripHtml(html);
        n_words = GetNWords(words, n);

        return n_words;
      }


      /// <summary>
      /// Returns the first n words in text
      /// Assumes text is not a html string
      /// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
      /// </summary>
      /// <param name="text"></param>
      /// <param name="n"></param>
      /// <returns></returns>
      public static string GetNWords(string text, int n)
      {
        StringBuilder builder = new StringBuilder();

        //remove multiple spaces
        //http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
        string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
        IEnumerable<string> words = cleanedString.Split().Take(n + 1);

        foreach (string word in words)
          builder.Append(" " + word);

        return builder.ToString();
      }


      /// <summary>
      /// Returns a string of html with tags removed
      /// </summary>
      /// <param name="html"></param>
      /// <returns></returns>
      public static string StripHtml(string html)
      {
        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(html);

        var root = document.DocumentNode;
        var stringBuilder = new StringBuilder();

        foreach (var node in root.DescendantsAndSelf())
        {
          if (!node.HasChildNodes)
          {
            string text = node.InnerText;
            if (!string.IsNullOrEmpty(text))
              stringBuilder.Append(" " + text.Trim());
          }
        }

        return stringBuilder.ToString();
      }



    }
}

Merry Christmas!

Petras
  • 4,686
  • 14
  • 57
  • 89
0

Once you have your string you would have to count your words. I assume space is a delimiter for words, so the following code should find the first 2000 words in a string (or break out if there are fewer words).

string myString = "la la la";
int lastPosition = 0;
for (int i = 0; i < 2000; i++)
{
    int position = myString.IndexOf(' ', lastPosition + 1);
    if (position == -1) break;
    lastPosition = position;
}
string firstThousandWords = myString.Substring(0, lastPosition);

You can change indexOf to indexOfAny to support more characters as delimiters.

Mikael Svenson
  • 39,181
  • 7
  • 73
  • 79