I am looking for a good method to extract relevant keywords from text on a page using SQL or C#. I intend to use this to link these keywords to other parts of the website to navigate to relevant content.This seems pretty common across some blogs.
-
1Who determines what are the keywords? Is this some predefined list? – Martin Smith Feb 13 '11 at 15:37
2 Answers
One simple approach might be to download into memory using C#, filter out HTML tags, Javascript etc (i.e. identify the real content), break that up into individual words, filter vs a list of words which appear with a high frequency in any generic written document, count the frequency of each word occurring in the document, take the words which appear the most as keywords.
You would need to develop your filtered word list over time.
Depending on your domain it might be more appropriate to go about this the opposite way and build up a list of domain-specific keywords (or groups of keywords, so that "seatbelt" and "safety belt" etc would be recognised as the same word), and find how many times each word or word group appears in a given document. Those above a certain threshold, or top 5 or something, would be the keywords associated with that document.

- 9,552
- 4
- 50
- 78
There's a good informative answer from Joseph Turian to a more general version of this question on: How do I extract keywords used in text?