2

I have a UTF-8 byte array of data. I would like to search for a specific string in the array of bytes in C#.

byte[] dataArray = (some UTF-8 byte array of data);

string searchString = "Hello";

How do I find the first occurrence of the word "Hello" in the array dataArray and return an index location where the string begins (where the 'H' from 'Hello' would be located in dataArray)?

Before, I was erroneously using something like:

int helloIndex = Encoding.UTF8.GetString(dataArray).IndexOf("Hello");

Obviously, that code would not be guaranteed to work since I am returning the index of a String, not the index of the UTF-8 byte array. Are there any built-in C# methods or proven, efficient code I can reuse?

Thanks,

Matt

Matthew Steven Monkan
  • 8,170
  • 4
  • 53
  • 71

2 Answers2

4

One of the nice features about UTF-8 is that if a sequence of bytes represents a character and that sequence of bytes appears anywhere in valid UTF-8 encoded data then it always represents that character.

Knowing this, you can convert the string you are searching for to a byte array and then use the Boyer-Moore string searching algorithm (or any other string searching algorithm you like) adapted slightly to work on byte arrays instead of strings.

There are a number of answers here that can help you:

Community
  • 1
  • 1
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • I know it's late to point this out but I personally find there is a conflict with knowledge I gained from @JoelSpolky 's post [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). What you're saying is this: If I have for instance the string `var nl = "\n";` which yields exactly one byte when encoding to UTF8 `var bytes = new byte[] { 10 }` whose value is `10` then it must mean that `10` **can never** represent a half or a third of a Chinese character? – Eduard Dumitru Jun 11 '16 at 12:23
  • I'm not trying to say your idea is even partially wrong. I'm just a bit confused about what you're saying and I want clarify things for myself and possibly others. – Eduard Dumitru Jun 11 '16 at 12:27
  • "it must mean that 10 can never represent a half or a third of a Chinese character" I think that's correct. See https://en.wikipedia.org/wiki/UTF-8#Description . All multibyte character bytes are larger than binary 10000000 (0x80) which is larger than `\n` (0x10). – Olivier Lalonde Jan 21 '17 at 22:04
  • A valid utf-8 encoding for any unicode codepoint can never be a substring (or superstring) of a byte sequence encoding a different codepoint. That's one of the properties of UTF-8. (It could happen with older encodings, like UTF-7 etc...) ASCII chars map 1:1 to bytes <= 127, and all other codepoints map to sequences with all bytes > 127, where the first byte encodes the length of the whole sequence. – MarkusSchaber Aug 12 '22 at 12:21
1

Try the following snippet:

// Setup our little test.

string sourceText = "ʤhello";

byte[] searchBytes = Encoding.UTF8.GetBytes(sourceText);

// Convert the bytes into a string we can search in.

string searchText = Encoding.UTF8.GetString(searchBytes);

int position = searchText.IndexOf("hello");

// Get all text that is before the position we found.

string before = searchText.Substring(0, position);

// The length of the encoded bytes is the actual number of UTF8 bytes
// instead of the position.

int bytesBefore = Encoding.UTF8.GetBytes(before).Length;

// This outputs Position is 1 and before is 2.

Console.WriteLine("Position is {0} and before is {1}", position, bytesBefore);
Pieter van Ginkel
  • 29,160
  • 8
  • 71
  • 111