I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.
The string contains:
- Technical information written in English (GB)
- Words are mostly separated by spaces
- A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
- A keyphrase may contain a space, a comma or other punctuation
- If two or more keywords appear together then it is likely a keyphrase e.g. "inverter drive"
- The text also contains HTML but this can be removed beforehand if necessary
- Non-keywords would be words like "and", "the", "we", "see", "look" etc.
- Keywords are case-insensitive e.g. "Inverter" and "inverter" are the same keyword
The algorithm has the following requirements:
- Operate in a batch-processing scenario e.g. run once or twice a day
- Process strings varying in length from roughly 200 to 7000 characters
- Process 1000 strings in less than 1 hour
- Will execute on a server with moderately good power
- Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
- Does not rely on a list of predefined keywords or keyphrases
- But can rely on a list of keyword exclusions e.g. "and", "the", "go" etc.
- Ideally transferable to other languages e.g. doesn't rely on language-specific features e.g. metaprogramming
- Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)
It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!
Just looking for advice and directions:
- Should this be regarded as two separate algorithms?
- Are there any established algorithms which I could follow?
- Are my requirements feasible?
Many thanks.
P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.