Hello I am trying to create a very fast algorithm to detect keywords or list of keywords in a collection.
Before anything, I have read a lot of stackoverflow (and other) posts without being able to improve the performances to the level I expect.
My current solution is able to analyze an input of 200 chars and a collection of 400 list in 0.1825ms (5 inputs analyzed in 1 ms) but this is way too long and I am hoping to improve this performances by at least 5 times (which is the requirement I have).
Solution tested :
- Manual research
- Highly complex Regex (groups, backreferences...)
- Simple regex called multiple times (to match each of the keyword)
- Simple regex to match input keywords followed by an intersect with the tracked keywords (current solution)
- Multi-threading (huge impact on performances (*100) so I am not sure that would be the best solution for this problem)
Current solution :
input (string) : string to parse and analyze to verify the keywords list contained in it. Example : "hello world! How are you Mr #piloupe?".
tracks (string[]) : array of string that we want to match (space means AND). Example : "hello world" matches a string that contains both 'hello' and 'world' whatever their location
keywordList (string[][]) : list of string to match from the input. Example : { { "hello" }, { "#piloupe" }, { "hello", "world" } }
uniqueKeywords (string[]) : array of string representing all the unique keywords of the keywordList. With the previous keywordList that would be : { "hello", "#piloupe", "world" }
All these previous information does not require any performances improvement as they are constructed only once for any input.
Find the tracks algorithm:
// Store in the class performing the queries
readonly Regex _regexToGetAllInputWords = new Regex(@"\#\w+|\w+", RegexOptions.Compiled);
List<string> GetInputMatches(input)
{
// Extract all the words from the input
var inputWordsMatchCollection = _regexToGetAllInputWords.Matches(input.ToLower()).OfType<Match>().Select(x => x.Value).ToArray();
// Get all the words from the input matching the tracked keywords
var matchingKeywords = uniqueKeywords.Intersect(inputWordsMatchCollection).ToArray();
List<string> result = new List<string>();
// For all the tracks check whether they match
for (int i = 0; i < tracksKeywords.Length; ++i)
{
bool trackIsMatching = true;
// For all the keywords of the track check whether they exist
for (int j = 0; j < tracksKeywords[i].Length && trackIsMatching; ++j)
{
trackIsMatching = matchingKeywords.Contains(tracksKeywords[i][j]);
}
if (trackIsMatching)
{
string keyword = tracks[i];
result.Add(keyword);
}
}
return result;
}
Any help will be greatly appreciated.