78

I have a string that I need to convert to the equivalent array of bytes in .NET.

This ought to be easy, but I am having a brain cramp.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
JonStonecash
  • 1,352
  • 2
  • 10
  • 17

4 Answers4

103

You need to use an encoding (System.Text.Encoding) to tell .NET what you expect as the output. For example, in UTF-16 (= System.Text.Encoding.Unicode):

var result = System.Text.Encoding.Unicode.GetBytes(text);
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 4
    There are a lot more encodings in System.Text.Encoding than just Unicode: make sure you understand which one you need. – Joel Coehoorn Oct 27 '08 at 21:24
  • 1
    Joel: Hence I wrote “for example”. ;-) But your comment is of course valid. – Konrad Rudolph Oct 27 '08 at 21:27
  • :) Trying to help show where the non-UTF16 encodings are- I probably could have worded it better. – Joel Coehoorn Oct 27 '08 at 21:42
  • can you please see my [question](https://stackoverflow.com/questions/61857579/converting-string-to-equivalent-byte-hex-in-c-sharp/61858072?noredirect=1#comment109444645_61858072) related to it ? – Moeez May 19 '20 at 04:34
43

First work out which encoding you want: you need to know a bit about Unicode first.

Next work out which System.Text.Encoding that corresponds to. My Core .NET refcard describes most of the common ones, and how to get an instance (e.g. by a static property of Encoding or by calling a Encoding.GetEncoding.

Finally, work out whether you want all the bytes at once (which is the easiest way of working - call Encoding.GetBytes(string) once and you're done) or whether you need to break it into chunks - in which case you'll want to use Encoding.GetEncoder and then encode a bit at a time. The encoder takes care of keeping the state between calls, in case you need to break off half way through a character, for example.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 2
    @JonSkeet: You don't really need the encoding unless you (or someone else) is actually going to *interpret* the bytes, do you? For tasks like compression, encryption, obfuscation, etc. the encoding seems kind of irrelevant... no reason to go through the trouble if you don't need to.. – user541686 Apr 30 '12 at 07:59
  • 10
    @Mehrdad: You *absolutely* do. An encoding *defines* what the conversion from a string to a byte array does. Compression and encryption are entirely different matters. Otherwise it's like saying the image format doesn't matter when you want to save a picture as a file - many different image formats may be okay, but there has to be *one* involved, by definition. – Jon Skeet Apr 30 '12 at 08:09
  • @JonSkeet: Can't you just say `byte[] bytes = new byte[str.Length * sizeof(char)]; Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length)`? Who cares what the encoding is (or if the string even has valid characters in the first place), as long as you know you can get it back in the same form by doing the reverse? – user541686 Apr 30 '12 at 08:14
  • 4
    @Mehrdad: That's using UTF-16 then. It's still an encoding - it's just it's the natural one used internally for `char`. (And you may very much care about the fact that that's twice as large as it needs to be if your string is all ASCII.) – Jon Skeet Apr 30 '12 at 08:18
  • 1
    @JonSkeet: Right, but my point is, the mere fact that the user wants to "get the bytes" doesn't mean that he even needs to *know* what "encoding" *means* at all... that only matters if he's *interpreting* them, not merely working with them as a black box. (Regarding the space issue: yes, that obviously *could* be an issue, but quite often when you "just want the bytes", that is irrelevant, as I would guess the case was here. It's obviously *beneficial* to know about encodings, but you don't *need* to know about them here, do you?) – user541686 Apr 30 '12 at 08:21
  • 9
    @Mehrdad: No, the user *does* need to know the encoding. Just because UTF-16 is in some sense the natural encoding *for .NET* doesn't mean it's the encoding he wants to use. The point of writing data out is so that it can be read again - and that will need to use the same encoding. The fact that the OP referred to "the equivalent array of bytes" suggests that they're unaware that encodings even exist, and it's **vitally** important to understand encodings if you're going to convert between text and binary representations. – Jon Skeet Apr 30 '12 at 08:24
  • 8
    I've seen *countless* people fail to preserve information correctly because they haven't understood encodings. In my experience, educating them about the topic is a much better approach than using `Buffer.BlockCopy` and *assuming* it's what they want. – Jon Skeet Apr 30 '12 at 08:25
  • 1
    @JonSkeet: Then what do you do if some character in the string is invalid in the encoding you want to "get the bytes" for (perhaps because someone *else* gave you the string, and you're not responsible for its contents... maybe it has private-use characters, or maybe they didn't even *tell* you the encoding)? Using any particular encoding makes no sense, because there might not be any conversion for your characters. By contrast, if you just use the method I mentioned, then it doesn't matter whether the characters are valid, because they would work correctly anyway. – user541686 Apr 30 '12 at 08:28
  • 3
    @Mehrdad: A string doesn't *have* an encoding (or it's always UTF-16). If it's read from UTF-8, it still ends up in UTF-16 internally. It's not that your method doesn't use an encoding - it's that it's *implicit*, which is a bad thing IMO. Obviously you need to use an *appropriate* encoding, but just trying to wave away the issue as if it didn't exist is a really, really bad idea IMO. Maintaining ignorance of encodings is *not* the way forward. If you want to use UTF-16, do so explicitly (`Encoding.Unicode`). – Jon Skeet Apr 30 '12 at 08:42
  • @JonSkeet: I don't understand your comment of *"A string doesn't have an encoding (or it's always UTF-16)"*... those two go against each other. Must a `System.String` always contain UTF-16? For that matter, *must* it obey any other particular encoding? – user541686 Apr 30 '12 at 08:48
  • 2
    @Mehrdad: It's always a sequence of `char`, which is itself a UTF-16 code unit. (Not a Unicode code point, note.) But it's meaningless to talk about "a UTF-8 string" for example. You can have "A UTF-8 representation of a string" (which would be a byte array) but that's a different matter. – Jon Skeet Apr 30 '12 at 08:49
  • @JonSkeet: I don't understand. If you claim a `string` must always contain valid UTF-16 data, then that's false (`"\uFFFF\uFFFF"`). And if you're claiming it *doesn't* necessarily contain valid UTF-16 data, and that it could represent data in *more* than one possible encoding, then I beg the question: what sense does it make to use `Encoding.XXX.GetBytes()` on the string, when you don't know what encoding to use? (It's not like people give you the encoding for every single `string` object they pass to you...) – user541686 Apr 30 '12 at 08:58
  • 3
    @Mehrdad: It depends on what you mean by "valid". It always contains UTF-16 code units, by definition. They don't have to map to defined Unicode characters, of course... but they're still UTF-16. So if you want to represent some value in a private range, you do so in UTF-16 - then convert to the UTF-8 (or whatever) encoding of the same private range characters later. If you don't know what encoding to use, you *should not* be converting to bytes at all. It's like asking to save an image without specifying an image format - just say no. – Jon Skeet Apr 30 '12 at 09:51
  • @JonSkeet: Sorry this is from the future, not sure how I missed the comment... but it makes perfect sense to need to encrypt/compress a string for transportation/storage without knowing (or caring) what encoding to use. The encoding need not come into play at all in many scenarios like these. – user541686 Jan 18 '13 at 03:08
  • 3
    @Mehrdad: It's fine to compress then uncompress some binary representation of a string without knowing what encoding it's in. It's *not* fine to treat the compressed binary data as if it were text. Any time you want to convert from a string to binary or vice versa, you *must* know which encoding to use, and be consistent both ways. – Jon Skeet Jan 18 '13 at 07:05
  • @JonSkeet: Yup, that's exactly what I've said too, right? [As long as you don't try to *interpret* the bytes then you don't need to worry about the encoding](http://stackoverflow.com/questions/241405/how-do-you-convert-a-string-to-a-byte-array-in-net/241466#comment13382343_241466). :) – user541686 Jan 18 '13 at 07:18
  • 5
    @Mehrdad: But *someone* is going to interpret the bytes later. You're right in saying that the compression/encryption part doesn't need to care, but whatever's going to later turn it back into a string absolutely does... and if no-one's *ever* going to interpret the data, there's not much point in it being there. So yes, you do still need to choose an encoding, and make sure it's used consistently. Which encoding you decide to use is *somewhat* arbitrary so long as it can encode all your text, although it will affect space etc. Arbitrary isn't the same as irrelevant though. – Jon Skeet Jan 18 '13 at 07:25
  • @JonSkeet: So you're saying I *must* choose an encoding if, for example, ***all*** I'm doing is converting a `string` to a `byte[]`, compressing it, and writing it to a file, so that tomorrow I can read it into a `byte[]` and decompress it into a `string` on the same machine? If so, I find that to be a little shocking of a statement -- why does the encoding matter? Yes, I am "interpreting" the string tomorrow, but how would the encoding be relevant? The only thing that matters is that I'm getting back what I started with... and that's it. – user541686 Jan 18 '13 at 07:28
  • 6
    @Mehrdad: Yes, absolutely. Just like you *must* choose an image format if you want to save a picture to disk. Use that analogy as far as you can. Strings aren't made of bytes (conceptually) so in order to convert *to* bytes, you have to go through some sort of conversion... and that is precisely the encoding. – Jon Skeet Jan 18 '13 at 07:33
  • @JonSkeet: Er... yes, it *must* go through ***some*** conversion, that's true by definition. But *you* don't have to *care* what the particular conversion *is*, *as long as a black box can decode the bytes for you*. Right? I feel like that should be obvious... why do you have to care what's *inside* the box (the particular encoding)? So, you don't have to know **anything** about *how* it works (or what the word "encoding" even *means*!)... *all* you need is `byte[] GetBytes(string)` and `string GetString(byte[])` and that's it! And that's what `BitConverter` does, no encoding hassle. – user541686 Jan 18 '13 at 07:38
  • In other words, it should be perfectly possible and legitimate for a person to know *nothing* about encodings (and never *need* to) and ask for the "`byte[]` representation" of a string, if he is never going to *interpret* the bytes. That's all I'm saying -- an answer that uses `BitConverter` for the conversion (or something similar) would do the job easily, and it would do so without mentioning the word "encoding" even once -- so really, the encoding isn't something the OP *must* have to worry about. – user541686 Jan 18 '13 at 07:42
  • 2
    @Mehrdad: The encoding *is* the black box. There are lots of black boxes to pick from (different encodings). You don't need to know anything about the internals - but you need to pick the same conversion both ways. An answer using `BitConverter` is still picking an encoding - it's just choosing not to call it that. Would you prefer it if I said, "You need to pick a string-to-bytes conversion, usually via `System.Text.Encoding`"? That's exactly the same thing, just more clumsily stated IMO. Again, think about image formats: you need to choose the format to get from pixels to bytes. – Jon Skeet Jan 18 '13 at 07:44
  • The important point is that a user can't ask for **the** `byte[]` representation, because there are lots of different options available. – Jon Skeet Jan 18 '13 at 07:45
  • @JonSkeet: *"Would you prefer it if I said, "You need to pick a string-to-bytes conversion, usually via `System.Text.Encoding`""* -- Yes! Exactly: if you had said that, then the user would need to know **nothing** about Unicode in order to achieve his goal! *That's* the crucial difference between `Text.Encoding` and `BitConverter` -- one of them is for when you *do* care about the encoding, and the other is for when the encoding is 100% irrelevant to your goal. That's why I commented here: you said the OP *needs* to know about Unicode, when in reality it's irrelevant (just use `BitConverter`). – user541686 Jan 18 '13 at 07:45
  • 2
    @Mehrdad: Using `BitConverter` would still be making a choice, just without realizing that there *are* choices. (Also, I can't find which `BitConverter` method you mean, to be honest.) Again, think about the image version: if someone asked you how to save a picture to disk, would you not ask the natural question of which format? I don't see why it should be controversial for someone to know the pretty basic difference between bytes and characters, and the ability to choose different encodings. It's not like they have to *implement* them. – Jon Skeet Jan 18 '13 at 07:50
  • @JonSkeet: Oops, apologies for mentioning `BitConverter`, I meant `System.Buffer.BlockCopy`, which can copy any primitive array (e.g. a `char[]`) to a `byte[]` and vice-versa... I was thinking of the wrong class, sorry for confusing you. – user541686 Jan 18 '13 at 07:52
  • @JonSkeet: As for the picture task: it's the same thing. If `BlockCopy` can perform the encoding/decoding on your `Picture` class, then you need to know **nothing** about the various image formats (or even their *existence*) in order to achieve what you need, if you're never going to be *interpreting* the bytes yourself. There's no need to tell the user to go learn about BMPs. It's a significantly smaller hurdle to jump over (none, actually) than learning about Unicode! – user541686 Jan 18 '13 at 07:53
  • 4
    Do you have an example of a .NET image class which *could* handle `Buffer.BlockCopy`? You don't need to know *much* about Unicode, although obviously the more the better. But you *do* need to make a choice. If you want to write a `StringConverter` class which hides that choice and *always* uses `Encoding.UTF8` (or whatever) then go ahead - but you're still making a choice, and I don't think it actually benefits anyone to hide it. Sooner or later you're bound to run into a situation where you need to understand the very basics of encodings, so why not learn sooner rather than later? – Jon Skeet Jan 18 '13 at 08:20
  • 1
    @Mehrdad by letting the black box arbitrarily decide on an encoding, and especially relying on the underlying .net representation of string in UTF-16, you introduce future potential bugs. What if a next update to .net system changes the way strings are represented in memory? Instead of Little-Endian it could be Big-Endian for instance. Suppose we convert a string to a byte array your way, then compress it. After some months and a .net update, we try to decompress and convert back to string. But this time it will be garbage! All because Encoding wasn't explicitly specified. – Thanasis Ioannidis Jun 27 '18 at 11:53
  • @ThanasisIoannidis: It's been 5 years, but looking back it still seems I made it pretty clear that whether or not you should specify an encoding depends on what precisely you're trying to do. And note that this is **not** "letting a black box decide on an encoding". Nowhere does `BlockCopy` decide on any encoding, and that's the point. e.g. If what you need is lossless transmission on an identical system, you must use the raw bytes regardless of whether they are valid according to any particular encoding. OTOH, if you need interoperability, you encode/decode. – user541686 Jun 27 '18 at 12:06
  • @Mehrdad assuming there are raw bytes in the first place. It happens that .net implements strings with an underlying char array but that is implementation details. Even between identical systems, no-one guarantees you there will be an underlying array to get raw bytes from. It could easily change into a linked list or any other data structure (unlikely, but still you get the point). Still you will need to specify a way to convert that string (with this weird underlying implementation) into a byte sequence, and that way of converting from string to byte is called an encoding. – Thanasis Ioannidis Jun 27 '18 at 12:23
  • @ThanasisIoannidis: First of all, C# lets you pin a string and access the underlying characters directly, so you're wrong right off the bat. Second, even if that wasn't the case, a linked list (or anything else) wouldn't change anything. Whatever the underlying implementation is, you have `Buffer.BlockCopy()` and `string.ToCharArray` that give you raw bytes that can be used for perfect reconstruction. Whether they send someone to climb Mount Everest and radio the characters to the moon and back is up to the framework and not your business, and entirely irrelevant. – user541686 Jun 27 '18 at 12:28
  • @ThanasisIoannidis: Imagine writing a communication library for your program that runs on two machines, maybe with APIs `void Send(string)`, `string Receive()`. You *really ought to be able to transmit a `string` by itself* just like you would transmit a `char[]` or `byte[]`. It is really is none of your library's business whether that `string` is UTF-16LE, UTF-16BE, or otherwise. It could be entirely random code units for all you care. Your library can and must do its job of lossless transmission regardless. And assuming an encoding internally isn't just unnecessary; it *loses information*. – user541686 Jun 27 '18 at 12:31
  • @Mehrdad as for the char array, it is not a byte array until **some** kind of encoding is applied to it. `BlockCopy` does this encoding in your case, even if that encoding is just memory copying each byte of the char array. It doesn't need to be one of the `System.Text.Encodings` (as to not lose information). Whatever way you use to get the byte array is an encoding. A contract on how do you get byte[] from string.If the library you mention is to convert back and forth withing the same system or identical systems, yes you do not need to specify the encoding. The library does this for you. – Thanasis Ioannidis Jun 27 '18 at 12:53
  • But even with the same library, if it relies on the underlying implementation, bugs could be introduced. You can't guaranty the system will be identical when decoding happens. What if .net changes from Little-Endian to Big-Endian on the receiving part of the transmition? `ToCharArray` will encode in Little-Endian and `FromCharArray` on the receiving part will assume Big-Endian which will result in corrupted data. Clearly your way is a way to convert a `string` to `byte[]` in .net. But explicity specifying an encoding is also another way to convert a `string` to `byte[]` and seems more robust. – Thanasis Ioannidis Jun 27 '18 at 12:54
  • @ThanasisIoannidis: The question is *who* is providing *what* contract and whether the callee should care or not. But at this point you're just repeating yourself. I don't have anything to add. Feel free to move on. – user541686 Jun 27 '18 at 13:08
  • can you please see my [question](https://stackoverflow.com/questions/61857579/converting-string-to-equivalent-byte-hex-in-c-sharp/61858072?noredirect=1#comment109444645_61858072) related to it ? – Moeez May 19 '20 at 04:34
  • @Faisal: Please don't use comments on old questions (over a decade old in this case) to attract attention to a new question unless the new question has *specifically* come out of discussion in the existing comments. – Jon Skeet May 19 '20 at 07:22
20

What Encoding are you using? Konrad's got it pretty much down, but there are others out there and you could get goofy results with the wrong one:

byte[] bytes = System.Text.Encoding.XXX.GetBytes(text)

Where XXX can be:

ASCII
BigEndianUnicode
Default
Unicode
UTF32
UTF7
UTF8
swilliams
  • 48,060
  • 27
  • 100
  • 130
10

Like this:

    string test = "text";
    byte[] arr = Encoding.UTF8.GetBytes(test);
Igal Tabachnik
  • 31,174
  • 15
  • 92
  • 157