Extract meaningful word from text

Question

I want to extract all the meaningful words from html page content using regex function in C# to make tokenization , and this is what i made but still have garbage ,how could i do that ??

    //Remove Html tags
        content = Regex.Replace(content, @"<.*?>", " ");

        //Decode Html characters
        content = HttpUtility.HtmlDecode(content);

        //Remove everything but letters, numbers and whitespace characters
        content = Regex.Replace(content, @"[^\w\s]", string.Empty);

        //Remove multiple whitespace characters
        content = Regex.Replace(content, @"\s+", " ");

        //remove any digits
        content = Regex.Replace(content, @"[\d-]"," ");

        //remove words less than 2 and more than 20 length
        content = Regex.Replace(content, @"\b\w{2,20}\b", string.Empty);

Your last regex is inverted. It's removing every 2-20 letter word. — Phylogenesis, Apr 13 '15 at 23:41
to extract all meaningful words from web pages to make simple search engine — Wael, Apr 13 '15 at 23:41
Again, what do you mean by "Meaningful"? Do you have a specific set of words that you are looking for? — Ben, Apr 13 '15 at 23:42
i need all words of a length between 2 char and 20 char what should i do to get this ? @Phylogenesis — Wael, Apr 13 '15 at 23:43
Nooo , i want all words that have a meaning to make simple database of all words in specific web pages , this is the first step in making search engine called "Tokenization" @Ben — Wael, Apr 13 '15 at 23:46
So you're looking for any word that might have a definition? Like "cat" or "dog"? You want to pass over slang words like "lol" and "ttyl"? — Ben, Apr 13 '15 at 23:50
You probably want to change the last regex to `\b(\w|\w{21,})\b`. — Phylogenesis, Apr 13 '15 at 23:53
no , this words are ok , but the main purpose is to neglect special characters , Tags , Numbers ...etc , only need Words @Ben — Wael, Apr 13 '15 at 23:53
@Phylogenesis this runs ok? I ran it with `1cat 2dog` and got an outcome of `_cat__dog`... (the underscores indicate blank spaces)\ — Ben, Apr 13 '15 at 23:55
Regex is used to process regular language, html is not regular, it is generally advised not to use regex to process html, however I am unaware of a better mechanism. If you are interested read more at this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Des Horsley, Apr 14 '15 at 01:19

score 1 · Accepted Answer · answered Apr 14 '15 at 00:40

Using a RegEx for HTML processing is usually more trouble than it's worth. Grab the HtmlAgilityPack and use that to walk through the HTML DOM extracting any content inside text nodes. You could use something similar to the class below to gather up all of the text blocks in an HTML string.

public sealed class HtmlTextExtractor
{
    private readonly string m_html;

    public HtmlTextExtractor(string html)
    {
        m_html = html;
    }

    public IEnumerable<string> GetTextBlocks()
    {
        var htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(m_html);

        var text = new List<string>();
        WalkNode(htmlDocument.DocumentNode, text);

        return text;
    }

    private void WalkNode(HtmlNode node, List<string> text)
    {
        switch (node.NodeType)
        {
                case HtmlNodeType.Comment:
                    break; // Exclude comments?

                case HtmlNodeType.Document:
                case HtmlNodeType.Element:
                    {
                        if (node.HasChildNodes)
                        {                   
                            foreach (var childNode in node.ChildNodes)
                                WalkNode(childNode, text);
                        }
                    }
                    break;

            case HtmlNodeType.Text:
                {
                    var html = ((HtmlTextNode)node).Text;
                    if (html.Length <= 0)
                        break;

                    var cleanHtml = HtmlEntity.DeEntitize(html).Trim();
                    if (!string.IsNullOrEmpty(cleanHtml))
                        text.Add(cleanHtml);
                }
                break;
        }
    }
}

You can then focus on splitting/tokenizing the text after that.

var extractor = new HtmlTextExtractor(html);
var textBlocks = extractor.GetTextBlocks();

var words = new List<string>();
foreach (var textBlock in textBlocks)
{
    words.AddRange(textBlock.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
}

var distinctWords = words.Select(word => CleanWord(word))
    .Where(word => word.Length > 2 && word.Length < 20 && !string.IsNullOrEmpty(word))
    .Distinct()
    .OrderBy(word => word);

And finally cleaning up individual words or tokens.

public string CleanWord(string word)
{
    //Remove everything but letters, numbers and whitespace characters
    word = Regex.Replace(word, @"[^\w\s]", string.Empty);

    //Remove multiple whitespace characters
    word = Regex.Replace(word, @"\s+", " ");

    //remove any digits
    word = Regex.Replace(word, @"[\d-]"," ");

    return word.Trim();
}

Obviously this is the most simple implementation imaginable. It is extremely primitive, won't work well in non-English languages that don't split around spaces, doesn't handle punctuation well etc., but it should give you an idea of the individual parts. You can look at things like Lucene.NET to improve your tokenization and there are probably lots more libraries available if you want to improve the implementation.

Thank you, that works great! I also found a useful class to remove unwanted words such as 'the' or 'a'. https://www.dotnetperls.com/stopword-dictionary — Yovav, Feb 21 '17 at 05:49

Extract meaningful word from text

1 Answers1