I have a field in my database that holds input from an html input. So I have in my db column data. What I need is to be able to extract this and display a short version of it as an intro. Maybe even the first paragraph if possible.
4 Answers
The Html Agility Pack is usually the recommended way to strip out the HTML. After that it would just be a matter of doing a String.Substring
to get the bit of it that you want.
If you need to get out the 2000 first words I suppose you could either use IndexOf
to find a whitespace 2000 times and loop through it until then to get the index to use in the call to Substring
.
Edit: Add sample method
public int GetIndex(string str, int numberWanted)
{
int count = 0;
int index = 1;
for (; index < str.Length; index++)
{
if (char.IsWhiteSpace(str[index - 1]) == true)
{
if (char.IsLetterOrDigit(str[index]) == true ||
char.IsPunctuation(str[index]))
{
count++;
if (count >= numberWanted)
break;
}
}
}
return index;
}
And call it like:
string wordList = "This is a list of a lot of words";
int i = GetIndex(wordList, 5);
string result = wordList.Substring(0, i);

- 54,199
- 15
- 94
- 116
-
I am using the HTML Agility pack and it does strip all the HTML, all I need now would be a code sample to loop thru the string and get the first 2000 words. – Kenyana Jul 05 '10 at 11:18
-
@Kenyana: Added a sample method for that with a sample for how to call it. Not sure if it's very efficient and might not count completely correctly but should at least give you an idea. – Hans Olsson Jul 05 '10 at 11:32
-
This is my sample code! I have it within a class which seems to strip and add back the html elements to display on page. But it doesn't limit to the words I want. NB: I got that code from another thread on this site. How do I post code samples on this page? – Kenyana Jul 05 '10 at 12:17
-
@Kenyana: Doesn't surprise me, I think if you asked a lot of people to do this many would come up with very similar code. Just post the code as text, but prefix each line with 4 spaces. There's a button in the editor that will do it for you if you select all the text first. – Hans Olsson Jul 05 '10 at 12:27
Something like this maybe?
public string Get(string text, int maxWordCount)
{
int wordCounter = 0;
int stringIndex = 0;
char[] delimiters = new[] { '\n', ' ', ',', '.' };
while (wordCounter < maxWordCount)
{
stringIndex = text.IndexOfAny(delimiters, stringIndex + 1);
if (stringIndex == -1)
return text;
++wordCounter;
}
return text.Substring(0, stringIndex);
}
It's quite simplified and doesnt handle if multiple delimiters comes after each other (for instance ", "). you might just want to use space as a delimiter.
If you want to get just the first paragraph, simply search after "\r\n\r\n" <-- two line breaks:
public string GetFirstParagraph(string text)
{
int pos = text.IndexOf("\r\n\r\n");
return pos == -1 ? text : text.Substring(0, pos);
}
Edit:
A very simplistic way to strip HTML:
return Regex.Replace(text, @”<(.|\n)*?>”, string.Empty);

- 99,844
- 45
- 235
- 372
I had the same problem and combined a few Stack Overflow answers into this class. It uses the HtmlAgilityPack which is a better tool for the job. Call:
Words(string html, int n)
To get n words
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace UmbracoUtilities
{
public class Text
{
/// <summary>
/// Return the first n words in the html
/// </summary>
/// <param name="html"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string Words(string html, int n)
{
string words = html, n_words;
words = StripHtml(html);
n_words = GetNWords(words, n);
return n_words;
}
/// <summary>
/// Returns the first n words in text
/// Assumes text is not a html string
/// http://stackoverflow.com/questions/13368345/get-first-250-words-of-a-string
/// </summary>
/// <param name="text"></param>
/// <param name="n"></param>
/// <returns></returns>
public static string GetNWords(string text, int n)
{
StringBuilder builder = new StringBuilder();
//remove multiple spaces
//http://stackoverflow.com/questions/1279859/how-to-replace-multiple-white-spaces-with-one-white-space
string cleanedString = System.Text.RegularExpressions.Regex.Replace(text, @"\s+", " ");
IEnumerable<string> words = cleanedString.Split().Take(n + 1);
foreach (string word in words)
builder.Append(" " + word);
return builder.ToString();
}
/// <summary>
/// Returns a string of html with tags removed
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string StripHtml(string html)
{
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var root = document.DocumentNode;
var stringBuilder = new StringBuilder();
foreach (var node in root.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
stringBuilder.Append(" " + text.Trim());
}
}
return stringBuilder.ToString();
}
}
}
Merry Christmas!

- 4,686
- 14
- 57
- 89
Once you have your string you would have to count your words. I assume space is a delimiter for words, so the following code should find the first 2000 words in a string (or break out if there are fewer words).
string myString = "la la la";
int lastPosition = 0;
for (int i = 0; i < 2000; i++)
{
int position = myString.IndexOf(' ', lastPosition + 1);
if (position == -1) break;
lastPosition = position;
}
string firstThousandWords = myString.Substring(0, lastPosition);
You can change indexOf
to indexOfAny
to support more characters as delimiters.

- 39,181
- 7
- 73
- 79