0

Coming back to sort of play around with C++ a little bit after some years out of college, when looking up how to read a file as bytes in C++, some of the information I came across is that there isn't any sort of magical "readAsBytes" function, and you essentially are supposed to do this by reading a file the same way you would a text file, but with making sure to store the results into a char*. For instance:

someIFStream.read(someCharPointer, sizeOfSomeCharPointer);

That being said, even though chars in C++ are usually supposed to be right around 8 bits, this isn't exactly guaranteed. Start messing around with different platforms and text encodings long enough, and you're going to run into issues if you want a true array of bytes.

You could just use a uint8_t* and copy everything over from the char* . . . but dang, that's wasteful. Why can't we just get everything into a uint8_t* the first time around, while we're still reading the file, in a way that doesn't have to worry about whether it's a 32-bit machine or a 64-bit machine or UTF-8 or UTF-16 or what have you?

So the question is: Is this possible, at least in more modern C++ versions? If so, how? The reason I don't want to go from a char* to a uint8_t* is basically one of not having to waste a bunch of CPU cycles on some 50,000-iteration for loop. Thanks!

EDIT

I'm defining a byte as 8 bits for the purposes of this question, unless somebody strongly suggests otherwise. My understanding is that bytes were originally 6 bits, then became 7, and then finally settled down on 8, but that 32-bit groupings and such are usually thought of as small collections of bytes. If I'm mistaken, or if I should think of this problem differently (either way), please bring it up.

Panzercrisis
  • 4,590
  • 6
  • 46
  • 85
  • Possible duplicate of [Is char guaranteed to be exactly 8-bit long?](http://stackoverflow.com/questions/881894/is-char-guaranteed-to-be-exactly-8-bit-long) – jww Sep 20 '14 at 02:34
  • @jww No, I already saw that question and its accepted answer, which basically says that they can occasionally be longer. Since that means you can't truly rely on `chars` to be 8-bit, you have to find another datatype. That's part of what led me to asking this. – Panzercrisis Sep 20 '14 at 02:37
  • "Start messing around with different platforms and text encodings long enough, and you're going to run into issues if you want a true array of bytes." In what platform is the type char in c++ not 8-bit? – thang Sep 20 '14 at 02:58
  • @thang: http://stackoverflow.com/questions/2098149/what-platforms-have-something-other-than-8-bit-char – Crowman Sep 20 '14 at 03:02
  • As it happens in those platforms for which char is not 8-bits, there is no 8-bit data unit. This means that unit8_t is also not defined. It would be just as difficult to process 8-bit on those platforms as it is to process 12 bit data on standard 8-bit machines. In this case, there's no point trying to swap out char for unit8_t as asked in the question. – thang Sep 20 '14 at 03:15

1 Answers1

2

A char is one byte, and a file is a sequence of bytes. It doesn't matter whether the machine is 32-bit or 64-bit or something else, and it doesn't matter whether text is stored in UTF-8 or UTF-16 or something else. A file contains bytes, and each byte fits in a char. This is required by the standard.

What can vary is how many bits are in a byte on a particular platform. If it's 8, then char is the same as uint8_t (aside from signedness, which doesn't affect how the data is stored) and you can just read bytes directly into a uint8_t. But if a byte is, say, 10 bits, you're going to have to cast all those chars in a loop, since reading from the file gives you a sequence of 10-bit bytes and you need to chop off two bits from each one.

If you want your program to be adaptible to different byte sizes, you could use #if CHAR_BIT == 8 to determine whether to read straight into a uint8_t array or read into a char array and then cast all the bytes into uint8_t afterward.


Since you're "coming back to C++" and concerned about UTF-8 vs. UTF-16 when reading raw char data from a file, I'm guessing you're accustomed to languages like Java and C# where the char type represents a Unicode character. That's not the case in C and C++. A char is a byte, and if you read, say, a multi-byte UTF-8 character from a file, you get each individual byte as a separate char, not the whole Unicode character as a single value.

Wyzard
  • 33,849
  • 3
  • 67
  • 87
  • Thanks! For now, I guess I'll just stick to `char` pointers then. – Panzercrisis Sep 20 '14 at 02:56
  • Note that `uint8_t` and friends are optional, and only need to be provided if an implementation provides integer types with those sizes. On a system where `CHAR_BIT` is not 8, `uint8_t` is likely to not be present, so casting to it wouldn't be an option. – Crowman Sep 20 '14 at 02:59
  • @PaulGriffiths, it seems silly for `uint8_t` to exist only on platforms where it's redundant… – Wyzard Sep 20 '14 at 03:09
  • @Wyzard: It is a bit of a head scratcher. I guess if you absolutely positively needed an exact 8 bit integer you could check for its presence and just quit if it wasn't defined. If `int8_t` is present it also has to have a two's complement representation, which isn't true for a `signed char` in general, so there's that too. – Crowman Sep 20 '14 at 03:13
  • @PaulGriffiths but in which platforms for which it is 2 bits of a head scratcher?... i'll let myself out. – thang Sep 20 '14 at 03:25