-4

I want to extract all the meaningful words from html page content using regex function in C# to make tokenization , and this is what i made but still have garbage ,how could i do that ??

    //Remove Html tags
        content = Regex.Replace(content, @"<.*?>", " ");

        //Decode Html characters
        content = HttpUtility.HtmlDecode(content);

        //Remove everything but letters, numbers and whitespace characters
        content = Regex.Replace(content, @"[^\w\s]", string.Empty);

        //Remove multiple whitespace characters
        content = Regex.Replace(content, @"\s+", " ");

        //remove any digits
        content = Regex.Replace(content, @"[\d-]"," ");

        //remove words less than 2 and more than 20 length
        content = Regex.Replace(content, @"\b\w{2,20}\b", string.Empty);
Wael
  • 13
  • 4
  • Meaningful for whom / what purpose? – Alex Apr 13 '15 at 23:40
  • Your last regex is inverted. It's removing every 2-20 letter word. – Phylogenesis Apr 13 '15 at 23:41
  • to extract all meaningful words from web pages to make simple search engine – Wael Apr 13 '15 at 23:41
  • Again, what do you mean by "Meaningful"? Do you have a specific set of words that you are looking for? – Ben Apr 13 '15 at 23:42
  • i need all words of a length between 2 char and 20 char what should i do to get this ? @Phylogenesis – Wael Apr 13 '15 at 23:43
  • Nooo , i want all words that have a meaning to make simple database of all words in specific web pages , this is the first step in making search engine called "Tokenization" @Ben – Wael Apr 13 '15 at 23:46
  • So you're looking for any word that might have a definition? Like "cat" or "dog"? You want to pass over slang words like "lol" and "ttyl"? – Ben Apr 13 '15 at 23:50
  • You probably want to change the last regex to `\b(\w|\w{21,})\b`. – Phylogenesis Apr 13 '15 at 23:53
  • no , this words are ok , but the main purpose is to neglect special characters , Tags , Numbers ...etc , only need Words @Ben – Wael Apr 13 '15 at 23:53
  • @Phylogenesis this runs ok? I ran it with `1cat 2dog` and got an outcome of `_cat__dog`... (the underscores indicate blank spaces)\ – Ben Apr 13 '15 at 23:55
  • 1
    Regex is used to process regular language, html is not regular, it is generally advised not to use regex to process html, however I am unaware of a better mechanism. If you are interested read more at this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Des Horsley Apr 14 '15 at 01:19

1 Answers1

1

Using a RegEx for HTML processing is usually more trouble than it's worth. Grab the HtmlAgilityPack and use that to walk through the HTML DOM extracting any content inside text nodes. You could use something similar to the class below to gather up all of the text blocks in an HTML string.

public sealed class HtmlTextExtractor
{
    private readonly string m_html;

    public HtmlTextExtractor(string html)
    {
        m_html = html;
    }

    public IEnumerable<string> GetTextBlocks()
    {
        var htmlDocument = new HtmlDocument();
        htmlDocument.LoadHtml(m_html);

        var text = new List<string>();
        WalkNode(htmlDocument.DocumentNode, text);

        return text;
    }

    private void WalkNode(HtmlNode node, List<string> text)
    {
        switch (node.NodeType)
        {
                case HtmlNodeType.Comment:
                    break; // Exclude comments?

                case HtmlNodeType.Document:
                case HtmlNodeType.Element:
                    {
                        if (node.HasChildNodes)
                        {                   
                            foreach (var childNode in node.ChildNodes)
                                WalkNode(childNode, text);
                        }
                    }
                    break;

            case HtmlNodeType.Text:
                {
                    var html = ((HtmlTextNode)node).Text;
                    if (html.Length <= 0)
                        break;

                    var cleanHtml = HtmlEntity.DeEntitize(html).Trim();
                    if (!string.IsNullOrEmpty(cleanHtml))
                        text.Add(cleanHtml);
                }
                break;
        }
    }
}

You can then focus on splitting/tokenizing the text after that.

var extractor = new HtmlTextExtractor(html);
var textBlocks = extractor.GetTextBlocks();

var words = new List<string>();
foreach (var textBlock in textBlocks)
{
    words.AddRange(textBlock.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries));
}

var distinctWords = words.Select(word => CleanWord(word))
    .Where(word => word.Length > 2 && word.Length < 20 && !string.IsNullOrEmpty(word))
    .Distinct()
    .OrderBy(word => word);

And finally cleaning up individual words or tokens.

public string CleanWord(string word)
{
    //Remove everything but letters, numbers and whitespace characters
    word = Regex.Replace(word, @"[^\w\s]", string.Empty);

    //Remove multiple whitespace characters
    word = Regex.Replace(word, @"\s+", " ");

    //remove any digits
    word = Regex.Replace(word, @"[\d-]"," ");

    return word.Trim();
}

Obviously this is the most simple implementation imaginable. It is extremely primitive, won't work well in non-English languages that don't split around spaces, doesn't handle punctuation well etc., but it should give you an idea of the individual parts. You can look at things like Lucene.NET to improve your tokenization and there are probably lots more libraries available if you want to improve the implementation.

ScheuNZ
  • 911
  • 8
  • 19
  • Thank you, that works great! I also found a useful class to remove unwanted words such as 'the' or 'a'. https://www.dotnetperls.com/stopword-dictionary – Yovav Feb 21 '17 at 05:49