C++ binary read gives segfault when write and read are in different files, but the same code put to one file works

Question

I would like to write a 2D integer array to a binary file in binarySave.cc and then read it in binaryRead.cc. But the execution of binaryRead gives: Segmentation fault (core dumped). However, when the contents of binarySave.cc and binaryRead.cc are placed to the same file (binarySaveRead.cc) the reading works like expected.

binarySave.cc

#include <iostream>
#include <fstream>

using namespace std;

int main() 
{
  int** a = new int*[10];
  for(int i=0; i<10; ++i)
  {
    a[i] = new int[2];
  }

  for(int i=0; i<10; ++i)
  {
    a[i][0] = 1;
    a[i][1] = 2;
  }

  ofstream out("test.bin", ios::binary);
  if (out.is_open())
  {
    out.write((char*)a, 10*2*sizeof(int));
  }
  out.close();

  for(int i=0; i<10; ++i)
  {
    delete [] a[i];
  }
  delete [] a;

  return 0;
}

binaryRead.cc

#include <iostream>
#include <fstream>

using namespace std;

int main() 
{
  int** a = new int*[10];
  for(int i=0; i<10; ++i)
  {
    a[i] = new int[2];
  }

  ifstream input("test.bin", ios::binary);
  input.read((char*)a, 10*2*sizeof(int));
  input.close();

  for(int i=0; i<10; ++i)
  {
    std::cout<<a[i][0]<<" "<<a[i][1]<<std::endl; //segfault
  }

  for(int i=0; i<10; ++i)
  {
    delete [] a[i];
  }
  delete [] a;

  return 0;
}

Execution gives

> ./binarySave
> ./binaryRead
Segmentation fault (core dumped)

But putting putting the exact same code to that same file makes it work.

binarySaveRead.cc

#include <iostream>
#include <fstream>

using namespace std;

int main() 
{
  int** a1 = new int*[10];
  for(int i=0; i<10; ++i)
  {
    a1[i] = new int[2];
  }

  for(int i=0; i<10; ++i)
  {
    a1[i][0] = 1;
    a1[i][1] = 2;
  }

  ofstream out("test2.bin", ios::binary);
  if (out.is_open())
  {
    out.write((char*)a1, 10*2*sizeof(int));
  }
  out.close();

  delete [] a1;

  //-------------------

  int** a2 = new int*[10];
  for(int i=0; i<10; ++i)
  {
    a2[i] = new int[2];
  }

  ifstream input("test2.bin", ios::binary);
  input.read((char*)a2, 10*2*sizeof(int));
  input.close();

  for(int i=0; i<10; ++i)
  {
    std::cout<<a2[i][0]<<" "<<a2[i][1]<<std::endl;
  }

  for(int i=0; i<10; ++i)
  {
    delete [] a2[i];
  }
  delete [] a2;

  return 0;
}

The output:

> ./binarySaveRead
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2
1 2

What is the problem when the write and read are in two files?

I am on openSuse 42.3 using g++ 4.8.5.

The code just absolutely makes no sense. To store data to a file, you have to serialize it. You can't just write whatever is in memory to a file and hope that later code can make sense of it. — David Schwartz, Nov 10 '17 at 10:57

score 4 · Accepted Answer · answered Nov 10 '17 at 11:00

4

You wrote your platform's internal representation of the data to a file. That data likely included references to the program's memory. When you ran a different program and read in the same data, the references pointed to nowhere special.

Say you and I are in the same room and I ask you what color your car is. You might say, "It's exactly the same as the ceiling". That would be perfectly understood by me since we're in the same room. But I can't just describe the color on the Internet that way. It would make no sense to people outdoors or in other rooms.

To store data to a file, you have to serialize it. That is, convert it to a known format that can be understood by other programs. You didn't do that.

You can't assume that what made sense to the first program, in its environment, will continue to make sense in the second program, in a completely different environment. You have to go to the effort of ensuring it can be understood in any environment.

We call the process of converting information into a form that anyone can understand "serialization". And you should learn how to do this so you can write data to files and sockets and be assured that other programs can make sense of the data.

answered Nov 10 '17 at 11:00

David Schwartz

179,497
17
214
278

Thanks for the explanation! I will rewrite it. – Máté Csigi Nov 10 '17 at 11:07
Although this is partly true (mark endianess etc.) this isn't the real problem here. – Pixelchemist Nov 10 '17 at 11:10
@Pixelchemist It's precisely the problem here. The data he wrote contains a reference to other data in his memory space that won't exist in the reading process. Of course writing his code properly will also avoid all the other kinds of mistakes you can make by not serializing your data as well. – David Schwartz Nov 10 '17 at 11:11
@DavidSchwartz: Yes it is in fact true, but imho the root cause is a misunderstanding of how memory allocations work (or more precisely a missing understanding why that data in question actually is NOT serial). – Pixelchemist Nov 10 '17 at 11:14
@Pixelchemist I guess it depends on what you mean by "root cause". True his code would have happened to work if not for that, but it still wouldn't be writing the data in any particular file format and would still be a disaster waiting to happen because of endianness, word size, and so on. The root cause in the design is the failure to serialize. – David Schwartz Nov 10 '17 at 11:15
@DavidSchwartz: On the same machine using code from identical compilation neither endianness nor word size etc. matter. It might be called lucky that he fell over the allocation instead of thinking it worked and deploying a program that went beserk at runtime at some point in the future, though. – Pixelchemist Nov 10 '17 at 11:26
@Pixelchemist That might be true of some machines sure, but there is no guarantee. Imagine, for example, a platform that's transitioning between two different encoding schemes (say newer CPUs are faster with the new scheme) and which scheme is used depends on whether all the libraries support the new scheme or not. And suppose different libraries are preloaded. He would still be writing code that just happened to work because there didn't happen to be any of the many ways it could fail getting triggered, but tomorrow it might happen not to work. – David Schwartz Nov 10 '17 at 11:34
@Pixelchemist We shouldn't teach people who are obviously beginners how to write code that works only because any of the dozens of ways it can fail (which they don't understand) just don't happen to be happening to them. Instead, we should teach them that files are streams of bytes and if you want to write to a file, you should define a format at the byte level and write code that is guaranteed to write the correct bytes. – David Schwartz Nov 10 '17 at 11:36
@DavidSchwartz: Right your are - you had that +1 anyway ;-) – Pixelchemist Nov 10 '17 at 11:58

score 1 · Answer 2 · answered Nov 10 '17 at 11:07

Apart from many other defects pointed out by others, please note that:

a2 points to the consistent block of 10*sizeof(int*) bytes, which is not equal to 10*2*sizeof(int).
each element in the block pointed by your a2 points to the memory block of size 2*sizeof(int)

Consequently, your read (and write) procedure will cause Undefined Behavior.

Pixelchemist · Answer 3 · 2017-11-10T12:03:53.497

Serialization (aka How to savely generate serial binary data frim internal program representations) is a hard nut to crack. Refer to the answer of @DavidSchwartz for general advice.
Please: If you do not understand how pointers and memory allocations by hand (via new/delete) actually work: Do not do it in C++! Use std::vector and the likes. (Although using vectors wouldn't have saved you from the problem you're facing here. Take is as a off-topic advice anyway.)

Your program leaks memory and certainly does not do what you think it does. Your array is not contiguously stored in memory.

Every allocation with new results in a -well- new memory location which is unrelated to any other call to new. See 1D or 2D array, what's faster? for further explanation.

Note that at the location of a/a1/a2 there are only 10 int* and not your data (1s and 2s).

I'll try to explain why the combination of both codes work (which is just because you do not properly delete stuff, actually).

So what you (in your combined write/read code) basically do is:

In your write code:

You allocate 10x int*
You allocate 10 times 2x int
Of your 10x int* each points to a single 2x int patch
You write those 10 int* to your binary file (where int* likely happens to be twice the size of int so it works by accident).
You delete the 10x int*
Attention: You do NOT delete the 10 times 2x int - they persist!

In your read code:

You allocate 10x int* to a2
You allocate 10 times 2x int
You read the binary int* pointers from the file saving them in a2 (again works by the likely coincidence that int* is twice as big as int).
Attention: At this point the pointers in a2 are replaced. They point to the 10 times 2x int patches from the write part now!
Attention: Your newly allocated 10 times 2x int memory patches from the read part are lost here as you have overwritten the only pointers to them.
You access/output the int storage pointed to by these pointers (which is the storage you allocated, filled and didn't delete in the write part).
You delete the 10 times 2x int patches you allocated in the write part (as your a2 pointers have been replaced)
You delete the 10x int* patch
Attention: The program terminates and you leak the 10 times 2x int patches from the read part as you do not delete them nor you do not have a pointer anymore in order to know where they are in the first place.

C++ binary read gives segfault when write and read are in different files, but the same code put to one file works

3 Answers3