Algorithm to find keywords and keyphrases in a string

Question

I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.

The string contains:

Technical information written in English (GB)
Words are mostly separated by spaces
A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
A keyphrase may contain a space, a comma or other punctuation
If two or more keywords appear together then it is likely a keyphrase e.g. "inverter drive"
The text also contains HTML but this can be removed beforehand if necessary
Non-keywords would be words like "and", "the", "we", "see", "look" etc.
Keywords are case-insensitive e.g. "Inverter" and "inverter" are the same keyword

The algorithm has the following requirements:

Operate in a batch-processing scenario e.g. run once or twice a day
Process strings varying in length from roughly 200 to 7000 characters
Process 1000 strings in less than 1 hour
Will execute on a server with moderately good power
Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
Does not rely on a list of predefined keywords or keyphrases
But can rely on a list of keyword exclusions e.g. "and", "the", "go" etc.
Ideally transferable to other languages e.g. doesn't rely on language-specific features e.g. metaprogramming
Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)

It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!

Just looking for advice and directions:

Should this be regarded as two separate algorithms?
Are there any established algorithms which I could follow?
Are my requirements feasible?

Many thanks.

P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.

you might want to look into MSSQL full text searching? - http://blog.sqlauthority.com/2008/09/05/sql-server-creating-full-text-catalog-and-index/ - http://msdn.microsoft.com/en-us/library/ms142571.aspx - it may or may not be able to do exactly what you want, but I would spend a few hours with it to see — house9, Jun 12 '12 at 22:25
For clarification, is point 8 in your list talking about spoken languages or programming languages? — 3Pi, Jun 12 '12 at 22:31
Thanks for pointing out that ambiguity, I am talking about programming languages. — Chris Cannon, Jun 12 '12 at 22:34
@house9 I can see that full-text search would enable me to identify keywords, but I can't see how it would enable me to weight those keywords. — Chris Cannon, Jun 12 '12 at 22:35
I think something like this isn't that difficult a program to write. My first impression would be to write it in a language that already has good string comparison routines, and is also fairly easy to multi-thread. My first thought is JAVA only because the threading model is pretty easy. Anyway, I don't see any reason why this couldn't be possible. — trumpetlicks, Jun 12 '12 at 23:30
See this question, just a bit different, and with php in mind. http://stackoverflow.com/questions/10721836/keyword-analysis-in-php — Matthew Vines, Jun 12 '12 at 22:23

Olivier Jacot-Descombes · Accepted Answer · 2018-08-19T16:12:34.173

The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.

To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T> (I'll be using C# here)

// Assuming that each line contains one stop word.
HashSet<string> stopWords =
    new HashSet<string>(File.ReadLines("C:\stopwords.txt"), StringComparer.OrdinalIgnoreCase);

Later you can look if a keyword candidate is in the stop word list with

If (!stopWords.Contains(candidate)) {
    // We have a keyword
}

HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.

Looking for the keywords can easily be done with Regex.

string text = ...; // Load text from DB
MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
                                        RegexOptions.IgnoreCase);
foreach (Match match in matches) {
    if (!stopWords.Contains(match.Value)) {
        ProcessKeyword(match.Value); // Do whatever you need to do here
    }
}

If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to @"\p{L}([:']?\p{L})*". The character class \p{L} contains all letters and letter modifiers.

The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.

Splitting the text into phrases involves searching for sentences ending with "." or "?" or "!" or ":". You should exclude dots and colons that appear within a word.

string[] phrases = Regex.Split(text, @"[\.\?!:](\s|$)");

This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.

But doesn't the above code rely on a pre-defined list of keywords "matches"? I was hoping that the keywords would be worked out through the algorithm. — Chris Cannon, Jun 13 '12 at 08:06
No, it excludes non-keywords, the so called stop words. Everything which is not a stop word ist a keyword. — Olivier Jacot-Descombes, Jun 13 '12 at 11:59
You could add additional conditions, like "a keyword must have a minimum length of 3 characters", for example. You also might want to restrict keywords to be nouns. Have a look at the open source C# project [SharpNLP](http://sharpnlp.codeplex.com) on CodePlex and its description [Statistical parsing of English sentences](http://www.codeproject.com/Articles/12109/Statistical-parsing-of-English-sentences) on The Code Project. — Olivier Jacot-Descombes, Jun 13 '12 at 12:54
How would you apply case insensitivity to the HashSet without adding every possible variation ? For example, the word "multiply" could be : Multiply, multiply, mUltiply, MulTiply -- approximatly 64 variations (number of letters squared). The simplest solution I think would be to add all lowercase, then compare the lower case word using O(1) lookup in the hash set ? — Kraang Prime, Dec 14 '16 at 20:42
I made the HashSet case insensitive by creating it with `new HashSet(StringComparer.OrdinalIgnoreCase);` — Olivier Jacot-Descombes, Dec 14 '16 at 21:06

Algorithm to find keywords and keyphrases in a string

1 Answers1

Linked