0

All I know is that the error occurs when this branch gets executed and the weights from it get passed down to tf.data.experimental.sample_from_datasets:

# ...
elif pretrain_cfg.schedule == PretrainSchedule.CONVERGE_LINEARLY:
    logger.info('[%s] - Pretrain: Using CONVERGE_LINEARLY schedule' % self.name)
    a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
    b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
    weights = a * const_task_weights + b * pretrain_task_weights

return tf.data.experimental.sample_from_datasets(datasets, weights=weights)

The following works:

weights = tf.cond(
    tf.greater(global_step, max_pretrain_steps),
    true_fn=lambda: const_task_weights,
    false_fn=lambda: pretrain_task_weights
)

but for some reason this here causes the SIGSEGV:

a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
weights = a * const_task_weights + b * pretrain_task_weights

I don't really see what the problem is but the problem comes definitely from this line:

weights = a * const_task_weights + b * pretrain_task_weights

The question is why. It might not be valid to have a dependency to the global_step in this context as since the weights parameter of sample_from_datasets.

However, in sample_from_datasets I don't see anything suspicious since inside sample_from_datasets the first thing that happens is

weights = ops.convert_to_tensor(weights, name="weights")

So passing a tensor to it should be fine.

Any ideas?


Error output:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /data/translation/multi-problem/hi2en/model/512-3-1-1024/de2en.hi2en/c19cfad259cad911/model.ckpt.
bash: line 1:  4153 Segmentation fault      (core dumped) env "CUDA_VISIBLE_DEVICES"="0" "LIBRARY_ROOTS"="/Users/username/Library/Caches/PyCharm2018.2/remote_sou...

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
  • A segfault when trying to save a checkpoint is IMO a reason to open a bug report issue. Even assuming you're doing something wrong, there should be checks in place to prevent this kind of situation. Can't say much more, I really have no clue of what the source of the error might be, as I don't know exactly what happens when TF saves a checkpoint and how this plays with the way you handle weights. – GPhilo Sep 13 '19 at 09:31
  • 1
    @GPhilo Sounds reasonable. I am going to open an issue on github :) – Stefan Falk Sep 13 '19 at 09:32
  • @GPhilo It was actually my own fault (see answer). :D – Stefan Falk Sep 13 '19 at 12:36

1 Answers1

1

Okay, it turns out the problem is not with tensorflow directly - actually not at all.

The problem was because const_task_weights and pretrain_task_weights did not have the same shape. I did not validate the input and had a bug somewhere else.

Just be aware that you might get this kind of error if the shapes do not match.

I guess this cannot be checked or determined by tensorflow so that will be something the user has to take care of (citation needed).

Stefan Falk
  • 23,898
  • 50
  • 191
  • 378