3

When comparing individual byte values from two separate byte[] sources (arrays / pointers), how would one perform a case INSENSITIVE comparison?

I have one very large array of bytes containing the "haystack" of strings I am accessing through pointer and I am comparing it to a "needle" pattern, but currently it's only returning when there is an exact case sensitive match.

Is it possible to create a lookup dictionary containing the upper-to-lower values and use that in the comparing loop or is there a faster way? (performance-wise)

Edit1:

The strings are UTF8 encoded.

The desired behavior would be: return true when comparing either a,a; A,A; or a,A. But since 'A' in UTF8 has a value of 65 and 'a' has a value of 97, I cannot do case-insensitive comparison.

JKurcik
  • 230
  • 1
  • 12
  • 3
    Convert each array to a string and then do a case-insensitive comparison... –  Dec 12 '18 at 17:37
  • 2
    It would depend on the encoding. Are you dealing with ascii string? – Xiaoy312 Dec 12 '18 at 17:38
  • 1
    What do you mean by the case-sensitivity of a byte? Do the arrays represent ascii-encoded strings? – Lee Dec 12 '18 at 17:38
  • @Xiaoy312 It's UTF8 encoded string – JKurcik Dec 12 '18 at 17:39
  • @Lee I mean that the individual bytes (considering I am using UTF8 actually multiple bytes) represent characters. – JKurcik Dec 12 '18 at 17:42
  • 2
    lowercase and uppercase ACSII codes have an offset of 32, so you can implement a comparison of `x == byte[x] || byte[x+32]` with x=uppercase, if I understand you correctly – Falco Alexander Dec 12 '18 at 17:43
  • @FalcoAlexander OP already stated he isn't using ASCII. He's using UTF8. –  Dec 12 '18 at 17:44
  • 1
    the same offset should be true for UTF8 – Falco Alexander Dec 12 '18 at 17:45
  • I have edited the question. @FalcoAlexander it appears UTF8 is offset by 20 .. will try and check if it applies for entire alphabet, thanks! :) – JKurcik Dec 12 '18 at 17:47
  • 1
    @JKurcik you mean hex 20 (=32)? – Falco Alexander Dec 12 '18 at 17:50
  • @Amy that would be impossible due to the fact the operation is running against 10s of millions of bytes, I wouldn't use pointers if it wasn't that memory expensive. Instantiating new strings every increment would probably kill the GC – JKurcik Dec 12 '18 at 17:51
  • @FalcoAlexander you are right, sorry! Could You post it as an answer please? :) Thanks! – JKurcik Dec 12 '18 at 17:52
  • @JKurcik Then use a stream reader and convert it a few characters at a time. –  Dec 12 '18 at 17:53
  • @Amy I guess that would also work, but for the performance and basically no overhead reasons I will go with FalcoAlexander's answer. Thank you however, in most cases your suggestions would work perfectly and would also have better readability. – JKurcik Dec 12 '18 at 17:58
  • One day this will be baked into .NET, where you'll be able to instantiate a native Utf8String without allocating. https://github.com/dotnet/corefx/issues/30503 – Asik Dec 12 '18 at 18:12

2 Answers2

3

Convert the byte array to a string and then do a case-insensitive comparison. Something like:

bool caseInsensitiveByteArrayComparison(byte[] a, byte[] b) {
    string aString = System.Text.Encoding.UTF8.GetString(a);
    string bString = System.Text.Encoding.UTF8.GetString(b);
    return string.Equals(aString, bString, StringComparison.CurrentCultureIgnoreCase);
}

Code shamelessly stolen from SO. See:

  1. How to convert UTF-8 byte[] to string?
  2. Is there a C# case insensitive equals operator?
Strikegently
  • 2,251
  • 20
  • 23
3

Lowercase and uppercase ACSII and UTF8 code's byte representation have an offset of 32 (or hex20), so you can implement a comparison of x == byte[x] || x == byte[x+32] with x=uppercase char value.

edit:

suppose you really have to deal only with small and capital english letters, you can hack around with bitwise operations to speed things up, as you can handle 8 byte / chars at once, because those only differ by the 3rd most significant bit:

'b' & 0b_1101_1111 == 'B' & 0b_1101_1111

so you could handle the byte array in 8 byte chunks:

void Main()
{
    byte[] a = "ASDADAGF".Select(x => (byte)(x) ).ToArray();
    byte[] b = "asdAdAGF".Select(x => (byte)(x) ).ToArray();
    bitCompared(a,b).Dump();
}

static bool bitCompared( byte[] b1, byte[]b2)
{
    UInt64 a = BitConverter.ToUInt64(b1, 0); //loop over the index
    UInt64 b = BitConverter.ToUInt64(b2, 0);
    UInt64 mask =0b_1101_1111_1101_1111_1101_1111_1101_1111_1101_1111_1101_1111_1101_1111_1101_1111;
    return (a &= mask) == (b &= mask);
}

afaik there are also even more ways to optimize with SIMD and other low level "hacks".....

Falco Alexander
  • 3,092
  • 2
  • 20
  • 39
  • 1
    do you mean `x == byte[i] || x == byte[i]+32`, or how is this supposed to work with logical or? (no criticism, I just don't get it, what about indexoutofrange and wraparound) – Cee McSharpface Dec 12 '18 at 18:03
  • you are right, it was a typo, but it could also be done by bitwise comparison as uppercase and lowercase chars onyl differ by the 2nd most significant bit!! I will add that later as this should even outperform the "normal" byte comparsion! – Falco Alexander Dec 12 '18 at 18:08
  • 2
    @FalcoAlexander is it worth checking if the byte value is outside the alphabet range to avoid false positives? (65-122 in decimal) – JKurcik Dec 12 '18 at 18:24
  • @JKurcik updated the answer to introduce the bitwise operations, even if they may not handle all your needs in special. – Falco Alexander Dec 13 '18 at 10:16
  • 1
    @FalcoAlexander thanks for the effort haha, although i'm affraid using BitConverter will instantiate new arrays and cause a massive performance hit. Try to simulate your operation in a scenario where you need to compare 30mil byte arrays. I managed to squeeze in this condition : `*x1 != *x2 && (*x1 < 65 || *x1 > 122 || *x2 < 65 || *x2 > 122 || *x1 + 32 != *x2)` which first tries exact match, then ensures both x1 and x2 are within alphabet and then tries lowercase comparison. I ran it 30+mil times and got under 5ms using pointers. Thanks for pointing me in the right direction though! – JKurcik Dec 18 '18 at 10:28
  • 1
    my example was only to show that working with 8 byte chunks with binary operations can be done in a first step. have a look at the `System.Numerics.Vectors` to get support for 128 bit and hardware-accelerated numeric types. The check for alphanumeric bytes can also be done on bit mask and shift level. may be the compiler does it already, but check the assembler sites of StackExchange for some very nice hints...I like that low level performance optimizing a lot. May be you make it a challenge with a new question with the optimisation tag and a defined set of data with an example to start with. – Falco Alexander Dec 18 '18 at 10:57
  • @JKurcik I just noticed this PR also handling case processing with performance view! https://github.com/dotnet/coreclr/pull/21615/ with nice benchmarking and insights+discussion – Falco Alexander Dec 22 '18 at 15:01