C# looking for subarrays in large byte array representing strings

Question

In my application I need to open a file, look for a tag and then do some operation based on that tag. BUT! the file content alternates every char with a /0, so that the text "CODE" becomes 0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 (expressed in hex byte).

The issue is that the terminator is also a /0 , so the "CODE123" with the terminator would look something like this:

0x43 0x00 0x4F 0x00 0x44 0x00 0x45 0x00 0x31 0x00 0x32 0x00 0x33 0x00 0x00 0x00

Since /0 is the null string terminator, if I use File.ReadAllText() i get only garbage, so I tried using File.ReadAllBytes() and then purging each byte equal to 0. This gets me readable text, but then I lose information on when the data ends, i.e. if in the file there was CODE123[terminator]PROP456[terminator]blablabla I end up with CODE123PROP456blablabla.

So I decided to gets the file content as a byte[], and then look for another byte[] initialized with the CODE-with-/0-inside data. This theoretically should work, but since the data array is fairly large (about 1.5 million elements) this takes way too long.

The final cherry on the cake is that I am looking for multiple occurences of the CODE tag, so I can't just go and stop as soon as I find it.

I tried modifying the LINQ posted as answer here: Find the first occurrence/starting index of the sub-array in C# as follows:

    var indices = (from i in Enumerable.Range(0, 1 + x.Length - y.Length)
                          where x.Skip(i).Take(y.Length).SequenceEqual(y)
                          select (int?)i).ToList();

but as soon as I tried to enumerate the result it just hogs down.

So, my question is: how could I EFFICIENTLY find multiple subarrays in a large array? thanks

See my answer elsewhere which explains how to implement a Boyer-Moore search for binary data: https://stackoverflow.com/a/37500883/106159 — Matthew Watson, Nov 09 '21 at 13:41
The nulls don't seem to be null string terminators. You just need read it with the correct encoding they are just part of the chars of that encoding. Presumably some kind of utf16 but you should know better then us what your files encoding is. ReadAllText has an overload for the encoding. — Ralf, Nov 09 '21 at 13:42
@Ralf that's exactly the problem: they are not terminators except when they were used as one, so If I try to interpret them i get garbage (the first one is treated as a null string terminator and basically ruin the whole string interpretation), regardless of what encoding I try. — Marcomattia Mocellin, Nov 09 '21 at 13:53
If you read with ReadAllText and with Encoding.Unicode you get a single string with string terminator to separate the individual substring. Then Split will give you an array of the individual strings. — Steve, Nov 09 '21 at 13:56
I'm not quite convinced ;) Have you tried encodings with changed byte order like BigEndianUnicode? — Ralf, Nov 09 '21 at 13:57
@Ralf yeah, I just tried, now i get garble but in other alphabets like japanese or chinese and maybe farsi? XD — Marcomattia Mocellin, Nov 09 '21 at 14:09
@Steve This is not applicable, since i have a \0 every other character, so by doing the split I get the single chars, and i still lose the true termination information. — Marcomattia Mocellin, Nov 09 '21 at 14:11
@MatthewWatson That looks promising! I still need to do A LOT of postprocess but it seems to get the indices really fast, nice! I'll update the question as soon as I manage to get to a working solution — Marcomattia Mocellin, Nov 09 '21 at 14:12
@MarcomattiaMocellin I have tried with a file with a zero to separate each character. No problem loading it passing the Encoding.Unicode to ReadAllText. Did you try it with your data? — Steve, Nov 09 '21 at 14:15
@Steve: Yep, I tried: https://ibb.co/gR3Fhj8 similar result by changing to BigEndianUnicode. That's why I need to go through the byte[] route — Marcomattia Mocellin, Nov 09 '21 at 14:22

score 0 · Accepted Answer · answered Nov 11 '21 at 15:53

The wonderful Boyer-Moore algoryth suggested by Matthew Wilson solved my problem amazingly.

I had then to find a solution for finding the actual string terminations, this looks too application-specific to be useful to somebody else so I didn't post it. If you think it may be useful, let me know and I'll post it here :)

C# looking for subarrays in large byte array representing strings

1 Answers1