0

I am not able to figure out how to do a large UTF-8 encoded text file search.

My case is I have a massive UTF-8 encoded file that I need to analyse, which contains texts that also have accented leters (different languages) and then I have a certain lookup string.

This lookup string is converted into a fixed byte[] array, while the contents of the source text file are loaded into memory as a series of fixed length arrays.

Then I have a comparison mechanism, that eventually boils down to this code (simplified for the sake of the question):

static int matchesCount = 0;

/// <summary>
/// The inner comparison function
/// </summary>
/// <param name="lookupArrayLength">Length of lookup array</param>
/// <param name="sourceArrayPointer">Source array pointer set to correct position by external loop</param>
/// <param name="lookupArrayPointer">Lookup array pointer set to position zero</param>
static unsafe void compare(int lookupArrayLength, byte* sourceArrayPointer, byte* lookupArrayPointer)
{
    for (int ii = 0; ii < lookupArrayLength; ii++, sourceArrayPointer++, lookupArrayPointer++)
        if (upperLowerCaseMismatch(sourceArrayPointer, lookupArrayPointer))
        {
            //No match, outer loop sets sourceArrayPointer to +1, to move a byte forward
            return;
        }

    //Match found, outer loop sets sourceArrayPointer to +lookupArrayLength
    matchesCount++;
}

static unsafe bool upperLowerCaseMismatch(byte* x1, byte* x2)
{
    return (
    *x1 != *x2 && //exact match
    (*x1 < 65 || *x1 > 122 || //x1 out of alphabet
    *x2 < 65 || *x2 > 122 || //x2 out of alphabet
    *x1 + 32 != *x2) //lowercase match
    );
}

My aim now is to compare not only "case-insensitively", but to also strip accents while comparing. e.g. č => c, ý => y etc..

I cannot convert the entire input string to string and Normalize it for memory and performance reasons, the analysis must be as fast as possible due to business limitations. Also I cannot simply use File.Read(), because the files are very large and there is significant performance loss and GC work when using this approach.

My idea was to start with what the UTF-8 definition states - that the first byte contains the byte count - so perhaps a switch based on the first byte value and then read a couple more bytes, convert them to integer and do another switch for each accented letter?

JKurcik
  • 230
  • 1
  • 12
  • 1
    Are the file and the look up must be the same? or are you just interested in the file containing the look up at the start? – Haytam Aug 27 '19 at 14:40
  • @Haytam the file is for instance a 20gb text file, the lookup string may only be a single word. The expectation is to analyse the file and see how many times it contains the lookup string. (the lookup string is usually a persons surname) – JKurcik Aug 27 '19 at 14:43
  • 1
    Okay now I understand, your `compare` method is called multiple times – Haytam Aug 27 '19 at 14:45
  • 1
    @JKurcik Have you thought about preprocessing the input file(s) and creating a index from that? If that's a possibility maybe look into third party tools that can do this for you, such as Windows Indexing Service. – Ian Newson Aug 27 '19 at 14:46
  • BTW, if you aim for performance, you need to avoid calling a method potentially **for every byte of a 20 GB large data chunk**. The Jitter might inline the method though, but it's not guaranteed. – dymanoid Aug 27 '19 at 14:51
  • @IanNewson I have, unfortunately the contents of the files are extremely random, so building the index is not possible. If only the data was structured. – JKurcik Aug 27 '19 at 14:53
  • Is it continuously one "line" of text, or would it be reasonable to go through the file line-by-line? – Corak Aug 27 '19 at 14:57
  • Read the file in chunks as strings. For each chunk, call [`CompareInfo.IndexOf`](https://stackoverflow.com/a/15178962/11683) with options to ignore diacritics, which is [`CompareOptions.IgnoreNonSpace`](https://stackoverflow.com/a/368850/11683) – GSerg Aug 27 '19 at 14:58
  • @Corak a continuous "one-line" stream of bytes, no NewLines, no structure.. its a raw output of a certain data recovery method, sometimes broken structures, so the entropy is the most limiting factor here. – JKurcik Aug 27 '19 at 15:01
  • Say the first byte is `[110]00011`, if you follow the UTF8 format, this means that the first character is encoded in 2 bytes, so you'll want to read the next byte and let's say it's `10100000`. You'll still need to compare these 2 bytes to all the characters with an accent to see if it's a match and I think that's too much. (btw the character is `à`) – Haytam Aug 27 '19 at 15:05
  • @Haytam thats actually the way I was thinking. - if its 2 byte character, I can do an ushort switch. If its 3 or 4 bytes, I can do a int switch. Switches are heavily optimized and generally pretty fast, as the compiler creates jump tables. I just don't know if there is anything else I haven't thought of – JKurcik Aug 27 '19 at 15:10

0 Answers0