I am not able to figure out how to do a large UTF-8 encoded text file search.
My case is I have a massive UTF-8 encoded file that I need to analyse, which contains texts that also have accented leters (different languages) and then I have a certain lookup string.
This lookup string is converted into a fixed byte[] array, while the contents of the source text file are loaded into memory as a series of fixed length arrays.
Then I have a comparison mechanism, that eventually boils down to this code (simplified for the sake of the question):
static int matchesCount = 0;
/// <summary>
/// The inner comparison function
/// </summary>
/// <param name="lookupArrayLength">Length of lookup array</param>
/// <param name="sourceArrayPointer">Source array pointer set to correct position by external loop</param>
/// <param name="lookupArrayPointer">Lookup array pointer set to position zero</param>
static unsafe void compare(int lookupArrayLength, byte* sourceArrayPointer, byte* lookupArrayPointer)
{
for (int ii = 0; ii < lookupArrayLength; ii++, sourceArrayPointer++, lookupArrayPointer++)
if (upperLowerCaseMismatch(sourceArrayPointer, lookupArrayPointer))
{
//No match, outer loop sets sourceArrayPointer to +1, to move a byte forward
return;
}
//Match found, outer loop sets sourceArrayPointer to +lookupArrayLength
matchesCount++;
}
static unsafe bool upperLowerCaseMismatch(byte* x1, byte* x2)
{
return (
*x1 != *x2 && //exact match
(*x1 < 65 || *x1 > 122 || //x1 out of alphabet
*x2 < 65 || *x2 > 122 || //x2 out of alphabet
*x1 + 32 != *x2) //lowercase match
);
}
My aim now is to compare not only "case-insensitively", but to also strip accents while comparing. e.g. č => c, ý => y etc..
I cannot convert the entire input string to string and Normalize it for memory and performance reasons, the analysis must be as fast as possible due to business limitations. Also I cannot simply use File.Read(), because the files are very large and there is significant performance loss and GC work when using this approach.
My idea was to start with what the UTF-8 definition states - that the first byte contains the byte count - so perhaps a switch based on the first byte value and then read a couple more bytes, convert them to integer and do another switch for each accented letter?