c++: working with bytes

Question

My problem is, that I need to load a binary file and work with single bits from the file. After that I need to save it out as bytes of course.

My main problem is - what datatype to choose to work in - char or long int? Can I somehow work with chars?

Btw, how long is your file? Is it really necessary to think about optimalization already? And do you have to change single bytes or are the 'single bits' chunks of bytes? — Michel Keijzers, Mar 09 '12 at 14:30
@Deepak: Using ints to parse binary data is just asking for endianness problems. — KillianDS, Mar 09 '12 at 14:30
It depends on what operations he wants to do, ANDing 8 chars is equal to one int operation.(x64) — Deepak, Mar 09 '12 at 14:34
Deepak: `sizeof(long int)` is not always the same as `sizeof(int)`. It's certainly not on the setup I'm typing this on. — Peter, Mar 09 '12 at 14:38
@Deepak: when its the same, then why sizefo(long int) != sizeof(int) here? — PlasmaHH, Mar 09 '12 at 14:41
@Deepak: I suggest you read the answer i link to in my answer. They cover size of types in depth. — daramarak, Mar 09 '12 at 14:48
@Deepak: If you 'and' those chars in a tight loop, it's possible your compiler will shift that off to an SSE instruction (which can handle even more than 8). — KillianDS, Mar 09 '12 at 14:56

score 6 · Answer 1 · answered Mar 09 '12 at 14:45

6

Unless performance is mission-critical here, use whatever makes your code easiest to understand and maintain.

answered Mar 09 '12 at 14:45

Steve Townsend

53,498
9
91
140

1

Disregard my answer, this is rule #1 – daramarak Mar 09 '12 at 14:46
2

+1 And do not reinvent the wheel if possible, if you do not have to work with a predefined serialization format, [don't go invent one](https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats). – KillianDS Mar 09 '12 at 14:53
Agree, even though it is so fun reinventing the wheel. "Look, mine is squared" – daramarak Mar 09 '12 at 14:56
It's possible that a clarified question could invite a more detailed recommendation. Not clear to me that this needs to be overthought from the info to hand, though. – Steve Townsend Mar 09 '12 at 15:00

score 5 · Accepted Answer · edited May 23 '17 at 12:03

Before beginning to code any thing make sure you understand endianess, c++ type sizes, and how strange they might be.

The unsigned char is the only type that is a fixed size (natural byte of the machine, normally 8 bits). So if you design for portability that is a safe bet. But it isn't hard to just use the unsigned int or even a long long to speed up the process and use size_of to find out how many bits you are getting in each read, although the code gets more complex that way.

You should know that for true portability none of the internal types of c++ is fixed. An unsigned char might have 9 bits, and the int might be as small as in the range of 0 to 65535, as noted in this and this answer

Another alternative, as user1200129 suggests, is to use the boost integer library to reduce all these uncertainties. This is if you have boost available on your platform. Although if going for external libraries there are many serializing libraries to choose from.

But first and foremost before even start optimizing, make something simple that work. Then you can start profiling when you start experiencing timing issues.

Yeah, world of programming gets strange at once you start exploring alien platforms ;) — daramarak, Mar 09 '12 at 15:23
You can use boost integer.hpp for portable int types. For example, if you need to ensure you get 64 signed bits, you can use boost::int64_t across different compilers and operating systems and you'll always get the type you expect. This is especially important when you need to reinterpret_cast data. — 01100110, Mar 09 '12 at 15:28

score 3 · Answer 3 · answered Mar 09 '12 at 14:28

It really just depends on what you are wanting to do, but I would say in general, the best speed will be to stick with the size of integers that your program is compiled in. So if you have a 32 bit program, then choose 32 bit integers, and if you have 64 bit, choose 64 bit.

This could be different if there are some bytes in your file, or if there are integers. Without knowing the exact structure of your file, it's difficult to determine what the optimal value is.

Michel Keijzers · Answer 4 · 2012-03-09T14:34:12.457

1

Your sentences are not really correct English, but as far as I can interpret the question you can beter use unsigned char (which is a byte) type to be able to modify each byte separately.

Edit: changed according to comment.

edited Mar 09 '12 at 14:34

answered Mar 09 '12 at 14:26

Michel Keijzers

15,025
28
93
119

2

What's an unsigned byte? byte is an unsigned char. – MByD Mar 09 '12 at 14:27
1

Now it is somewhat proper English. :) – Prof. Falken Mar 09 '12 at 14:28
Since there is no definition for `byte` in C, you can't say if it's signed or not. – Mr Lister Mar 09 '12 at 14:29
1

@Michel you edited it the wrong way round. you were looking for `unsigned char`. – Mr Lister Mar 09 '12 at 14:30
Fixed (Friday Afternoon Syndrome) – Michel Keijzers Mar 09 '12 at 14:34
There are some platforms (e.g. some TI DSPs) which have 16-bit `char`, since they cannot address bytes. On these, `sizeof(char) == 1`, `sizeof(short) == 1` and `sizeof(int) == 2`... – Mike DeSimone Mar 09 '12 at 14:59

111111 · Answer 5 · 2012-03-09T14:45:02.657

1

If you are dealing with bytes then the best way to do this is to use a size specific type.

#include <algorithm>
#include <iterator>
#include <cinttypes>
#include <vector>
#include <fstream>

int main()
{
     std::vector<int8_t> file_data;
     std::ifstream file("file_name", std::ios::binary);

     //read
     std::copy(std::istream_iterator<int8_t>(file),
               std::istream_iterator<int8_t>(),
               std::back_inserter(file_data));

     //write
     std::ofstream out("outfile");           
     std::copy(file_data.begin(), file_data.end(),
               std::ostream_iterator<int8_t>(out));

}

EDIT fixed bug

edited Mar 09 '12 at 14:45

answered Mar 09 '12 at 14:39

111111

15,686
6
47
62

the uint8_t are not guaranteed to be defined for all systems. But it much more clearly states the intent of the use. – daramarak Mar 09 '12 at 14:45
The C99 standard has been around a long time, and almost all systems have ``. (I can't think of one that doesn't, honestly. It's one of the easiest headers ever to provide.) The C++ equivalent might not be there, but that's easily worked around. – Mike DeSimone Mar 09 '12 at 14:57

score 1 · Answer 6 · answered Mar 09 '12 at 14:54

If you need to enforce how many bits are in an integer type, you need to be using the <stdint.h> header. It is present in both C and C++. It defines type such as uint8_t (8-bit unsigned integer), which are guaranteed to resolve to the proper type on the platform. It also tells other programmers who read your code that the number of bits is important.

If you're worrying about performance, you might want to use the larger-than-8-bits types, such as uint32_t. However, when reading and writing files, you will need to pay attention to the endianess of your system. Notably, if you have a little-endian system (e.g. x86, most all ARM), then the 32-bit value 0x12345678 will be written to the file as the four bytes 0x78 0x56 0x34 0x12, while if you have a big-endian system (e.g. Sparc, PowerPC, Cell, some ARM, and the Internet), it will be written as 0x12 0x34 0x56 0x78. (same goes or reading). You can, of course, work with 8-bit types and avoid this issue entirely.

c++: working with bytes

6 Answers6