0

Note: This question ran a little longer than I had planned. So, in summary:

  1. This a 'please help me debug this code' question.
  2. The first part of my question is why checking if (ptr) before deallocation doesn't stop a segfault
  3. The second part is how do I actually stop the segfault

I first present the stack trace from the segfault. Then I show the various methods and type definitions as I traced them back. Hopefully, this question and my thought process will be easy to follow (and answer). Now, for the actual question .....

I am currently running unittests on deeplab-public-version2, which is a deep learning system built on top of Caffe. When I run one of the unittests, I get the following segfault:

$ ./.build_release/test/test_layer_factory.testbin 
Cuda number of devices: 1
Current device id: 0
Current device name: GeForce GTX 1080 with Max-Q Design
[==========] Running 4 tests from 4 test cases.
[----------] Global test environment set-up.
[----------] 1 test from LayerFactoryTest/0, where TypeParam = caffe::CPUDevice<float>
[ RUN      ] LayerFactoryTest/0.TestCreateLayer
*** Aborted at 1580356126 (unix time) try "date -d @1580356126" if you are using GNU date ***
PC: @     0x7f74c84d298d cfree
*** SIGSEGV (@0xe28) received by PID 17549 (TID 0x7f74cacda680) from PID 3624; stack trace: ***
    @     0x7f74c883e890 (unknown)
    @     0x7f74c84d298d cfree
    @     0x7f74c8e00ff1 deallocate()
    @     0x7f74c8f3367c caffe::DenseCRFLayer<>::DeAllocateAllData()
    @     0x7f74c8f36738 caffe::DenseCRFLayer<>::~DenseCRFLayer()
    @     0x7f74c8f36c09 caffe::DenseCRFLayer<>::~DenseCRFLayer()
    @     0x55f92259abc7 caffe::LayerFactoryTest_TestCreateLayer_Test<>::TestBody()
    @     0x55f9225b892a testing::internal::HandleExceptionsInMethodIfSupported<>()
    @     0x55f9225b10ca testing::Test::Run()
    @     0x55f9225b11ac testing::TestInfo::Run()
    @     0x55f9225b12e5 testing::TestCase::Run()
    @     0x55f9225b17a0 testing::internal::UnitTestImpl::RunAllTests()
    @     0x55f9225b18e7 testing::UnitTest::Run()
    @     0x55f9225967c1 main
    @     0x7f74c845cb97 __libc_start_main
    @     0x55f922596bba _start
Segmentation fault (core dumped)

I also get this fault when running the complete suite of unittests.

Looking at the stack trace, I believe the issue comes when deallocating a DenseCRFLayer object. The deallocate() method is as follows:

void deallocate(float*& ptr) {
  if (ptr)
#ifdef SSE_DENSE_CRF
    _mm_free( ptr );
#else
  delete[] ptr;
#endif
  ptr = NULL;
}

So, I would have thought that the if (ptr) statement would ensure that the code would only attempt to deallocate an existing pointer. How could this not be the case?

If it's of any help, the DeAllocateAllData method looks like this:

template <typename Dtype>
void DenseCRFLayer<Dtype>::DeAllocateAllData() {
  deallocate(unary_);
  deallocate(current_);
  deallocate(next_);
  deallocate(tmp_);
}

where, these 4 variable are created as follows:

template <typename Dtype>
void DenseCRFLayer<Dtype>::AllocateAllData() {
  unary_   = allocate(unary_element_);
  current_ = allocate(unary_element_);
  next_    = allocate(unary_element_);
  tmp_     = allocate(unary_element_);  
}

which are allocated as follows:

float* allocate(size_t N) {
  float * r = NULL;
  if (N>0) {
#ifdef SSE_DENSE_CRF
    r = (float*)_mm_malloc( N*sizeof(float)+16, 16 );
#else
    r = new float[N];
#endif
  }

  memset( r, 0, sizeof(float)*N);
  return r;
}

Would it make sense to add an extra check in this method to see if each of the pointers are eligible for deallocation?

The only other puzzling thing here, which might be a contributing factor is that the object destructor seems to be called twice (!?). So, if it's of any relevance here are the destructor method

template <typename Dtype>
DenseCRFLayer<Dtype>::~DenseCRFLayer() {
  ClearPairwiseFunctions();
  DeAllocateAllData();
}

template <typename Dtype>
void DenseCRFLayer<Dtype>::ClearPairwiseFunctions() {
  for (size_t i = 0; i < pairwise_.size(); ++i) {
    delete pairwise_[i];
  }
  pairwise_.clear();
}

where pairwise has the following type:

std::vector<PairwisePotential*> pairwise_;

and PairwisePotential is the following virtual class:

class PairwisePotential {
 public:
  virtual ~PairwisePotential();
  virtual void apply(float * out_values, const float * in_values, float * tmp, int value_size) const = 0;
};

Finally, the unit test method is this:

namespace caffe {

template <typename TypeParam>
class LayerFactoryTest : public MultiDeviceTest<TypeParam> {};

TYPED_TEST_CASE(LayerFactoryTest, TestDtypesAndDevices);

TYPED_TEST(LayerFactoryTest, TestCreateLayer) {
  typedef typename TypeParam::Dtype Dtype;
  typename LayerRegistry<Dtype>::CreatorRegistry& registry =
      LayerRegistry<Dtype>::Registry();
  shared_ptr<Layer<Dtype> > layer;
  for (typename LayerRegistry<Dtype>::CreatorRegistry::iterator iter =
       registry.begin(); iter != registry.end(); ++iter) {
    // Special case: PythonLayer is checked by pytest
    if (iter->first == "Python") { continue; }
    LayerParameter layer_param;
    // Data layers expect a DB
    if (iter->first == "Data") {
#ifdef USE_LEVELDB
      string tmp;
      MakeTempDir(&tmp);
      boost::scoped_ptr<db::DB> db(db::GetDB(DataParameter_DB_LEVELDB));
      db->Open(tmp, db::NEW);
      db->Close();
      layer_param.mutable_data_param()->set_source(tmp);
#else
      continue;
#endif  // USE_LEVELDB
    }
    layer_param.set_type(iter->first);
    layer = LayerRegistry<Dtype>::CreateLayer(layer_param);
    EXPECT_EQ(iter->first, layer->type());
  }
}

}  // namespace caffe

Note: I don't quite understand how this object is constructed: When I look in the hpp file, I see this:

/**
 * @brief The DenseCRF layer performs mean-field inference under a
 *  fully-connected CRF model with Gaussian potentials.
 *
 */
template <typename Dtype>
class DenseCRFLayer : public Layer<Dtype> {
 public:
  explicit DenseCRFLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual ~DenseCRFLayer();

So, I do not think a default constructor is being used. Yet, when I do the following grep in the parent directory, I see this output, which does not show evidence of a non-default constructor:

$ grep -r 'DenseCRFLayer(' .
./include/caffe/layers/densecrf_layer.hpp:  explicit DenseCRFLayer(const LayerParameter& param)
./include/caffe/layers/densecrf_layer.hpp:  virtual ~DenseCRFLayer();
./src/caffe/layers/densecrf_layer.cpp:DenseCRFLayer<Dtype>::~DenseCRFLayer() {
$

I see where the constructor and destructor are declared, but I only see the destructor defined....

user1245262
  • 6,968
  • 8
  • 50
  • 77
  • 5
    Does `DenseCRFLayer` have a custom copy constructor (on you defined) or is it just using the compiler provided default? – NathanOliver Jan 30 '20 at 14:14
  • 1
    I'll just note that `if (ptr)` will *never* prevent a seg fault or undefined behavior, because it is actually quite legal and safe to `delete`, `delete[]`, or `free()` a null pointer. There's no way to test whether a non-null pointer is safe to deallocate. – Fred Larson Jan 30 '20 at 14:21
  • 1
    Checking for null also won't stop you from deleting the same pointer twice. Setting it null in the deallocate is helpful, but maybe not if it was copied to another variable. You can detect that particular problem by logging every pointer value that is allocated and deallocated, and parsing that log. – Kenny Ostrom Jan 30 '20 at 14:31
  • @NathanOliver - It's a little embarassing, but I am not certain. At the bottom of my question, I've added new text that shows how I found a declaration of a constructor, but no definition. Is there something obvious that I forgot to do, or should have realized? – user1245262 Jan 30 '20 at 14:32
  • 1
    @user1245262 Whenever you do dynamic memory management, you need to follow the rule of three. That is provide a copy constructor, a copy assignment operator and a destructor. This is important because by default the ones provided by the compiler just do a shallow copy. That means if you make a copy, both objects will point to the same memory. When one of the copies gets destroyed, it destroys the memory for all the other copies which leads to problems like this. I've closed this as a dupe that goes over this in great detail. – NathanOliver Jan 30 '20 at 14:36
  • If that winds up not being the issue. Let me know and we can reopen. – NathanOliver Jan 30 '20 at 14:36
  • @NathanOliver - Thanks. I'll read up and try to figure things out. As you can probably tell, this code is something of an `optimal` stretch for my C++ abilities... – user1245262 Jan 30 '20 at 14:40
  • @user1245262 -- You need to decide whether those objects are meant to be copyable or not. If they're meant to be copyable, then as stated, you need to provide the necessary glue (copy constructor, assignment op, desctructor) so that the class has correct copy semantics. If the class is not meant to be copied, you should turn off the copying by either declaring the copy constructor and assignment op as `delete`, or declare these functions, but not implement them. A third case is if the objects can be moveable but not copyable, then you create a move constructor and move assignment op. – PaulMcKenzie Jan 30 '20 at 14:49

0 Answers0