3

I am currently going through the steps to run deeplab the training of the exception_65 backbone on Cityscapes data set but unfortunately i run into a segmentation fault. I can not reproduce the error. E.g. training on PASCAL dataset works well. I checked the paths and several versions and combinations of tensorflow and drivers, etc. Even if i run the train.py script without GPU support i do get the same segmentation fault. I did the same steps on another PC and i worked. Anyone knows what the problem is?

My setup:

  • Ubuntu 18.04
  • NVIDIA RTX 2080 with driver version 430.65 (installed with .run file)
  • CUDA 10.0 (installed with .run file)
  • cudnn 7.6.5
  • Python 3.6
  • tensorflow 1.15

By running:

python3 "${WORK_DIR}"/train.py \
  --logtostderr \
  --training_number_of_steps=${NUM_ITERATIONS} \
  --train_split="train_fine" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --train_crop_size="769,769" \
  --train_batch_size=1 \
  --fine_tune_batch_norm=False \
  --dataset="cityscapes" \
  --tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_cityscapes_train/model.ckpt" \
  --train_logdir="${TRAIN_LOGDIR}" \
  --dataset_dir="${CITYSCAPES_DATASET}" 

I get following output:

I1119 16:52:49.856512 139832269989696 learning.py:768] Starting Queues.
Fatal Python error: Segmentation fault

Thread 0x00007f2cd086b700 (most recent call first):
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 296 in wait
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/queue.py", line 170 in get
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f2d3cc7e740 (most recent call first):
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490 in train_step
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775 in train
  File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 466 in main
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 250 in _run_main
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 299 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 472 in <module>
Segmentation fault (core dumped)

The back trace with gdb shows: GDB Output

talonmies
  • 70,661
  • 34
  • 192
  • 269
kuschi
  • 31
  • 3

3 Answers3

3

I had the same problem as described. I succeeded to solve it by doing two things:

  1. Be sure the name of your tfrecord (for me they are named train-00000-of-00010.tfrecord) are the same as --train_split="train".
  2. Change in data_generator.py, around lines 72 splits_to_sizes={'train_fine': 2975 by splits_to_sizes={'train': 2975.

What does the trick is to have the same name (for me it is train) in your .sh that launches the training, in your data_generator.py and in your tfrecord folder.

Achille
  • 31
  • 3
1

My problem looks like yours, and i realized that --dataset_dir is supposed to point to directory containing tfrecord data for cityscape, not the cityscape directory itself.

The code for retrieving data in data_generator.

def _get_all_files(self):
    """Gets all the files to read data from.

    Returns:
      A list of input files.
    """
    file_pattern = _FILE_PATTERN
    file_pattern = os.path.join(self.dataset_dir,
                                file_pattern % self.split_name)
    return tf.gfile.Glob(file_pattern)
Van Teo Le
  • 164
  • 3
  • 11
0

I still do not know what causes the segmentation fault, but the solution for me was by specifying a new dataset for cityscapes in data_generator.py

kuschi
  • 31
  • 3