0

I am using JupyterLab on AWS SageMaker. Kernel: conda_pytorch_latest_p36.

Traceback suggests a problem with a list object, maybe in regards to batch size? Most likely from my datasets.

When batch_size = 1 : RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated.


Called Code:

# re-train cell
args = """
      --max_epochs 20
      --progress_bar_refresh_rate 2
      --gradient_clip_val 0.5
      --log_gpu_memory True
      --gpus 1
    """.split()
run_training(args)

run_training():

import importlib
import os
import shutil
import tarfile
from argparse import ArgumentParser

import mlflow
import pytorch_lightning as pl

def run_training(input=None):
    args = parse_args(input)
    pl.seed_everything(args.seed)
    module = importlib.import_module('pytorch_lightning.loggers')
    logger = getattr(module, args.logging)(save_dir='logs')
    csv_logger = pl.loggers.CSVLogger(save_dir=f'{args.modeldir}/csv_logs')
    loggers = [logger, csv_logger]
    dm = OntologyTaggerDataModule.from_argparse_args(args)
    if args.model_uri:
        local_model_uri = os.environ.get('SM_CHANNEL_MODEL', '.')
        tar_path = os.path.join(local_model_uri, 'model.tar.gz')
        tar = tarfile.open(tar_path, "r:gz")
        tar.extractall(local_model_uri)
        tar.close()
        model_path = os.path.join(local_model_uri, args.checkpointfile)
        model = OntologyTaggerModel.load_from_checkpoint(model_path)
    elif args.checkpointfile:
        file_path = os.path.join(args.traindir, args.checkpointfile)
        model = OntologyTaggerModel.load_from_checkpoint(file_path)
    else:
        model = OntologyTaggerModel(
            **vars(args), num_classes=dm.num_classes, class_map=dm.class_map
        )
    checkpoint_callback = pl.callbacks.ModelCheckpoint(
        args.checkpointdir, save_last=True, save_weights_only=True
    )

    checkpoint_dir = os.environ.get('SM_CHANNEL_TRAINING', './')
    if checkpoint_dir != './':
        labels_file_orig = os.path.join(checkpoint_dir, args.labels)
        labels_file_cp = os.path.join(args.modeldir, os.path.basename(args.labels))
        shutil.copyfile(labels_file_orig, labels_file_cp)
    trainer = pl.Trainer.from_argparse_args(
        args, callbacks=[checkpoint_callback], logger=loggers
    )
    print('model', model)
    print('dm', dm)
    trainer.fit(model, dm) # CRASH !
    model_file = os.path.join(args.modeldir, 'last.ckpt')
    trainer.save_checkpoint(model_file, weights_only=True)

Traceback:

Global seed set to 42
Loading pretrained model bert-base-cased
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMulticlassSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForMulticlassSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMulticlassSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMulticlassSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifiers.1.bias', 'classifiers.0.weight', 'classifiers.1.weight', 'classifiers.0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
## dm ## <__main__.OntologyTaggerDataModule object at 0x7f80883dcc18>

  | Name            | Type                                    | Params
----------------------------------------------------------------------------
0 | model           | BertForMulticlassSequenceClassification | 108 M 
1 | valid_acc       | Accuracy                                | 0     
2 | valid_f1        | F1                                      | 0     
3 | valid_acc_multi | ModuleList                              | 0     
----------------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.410   Total estimated model params size (MB)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]
###score: val_score### 0.11666667461395264
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric Accuracy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric F1 was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric CompositionalMetric was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
  warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Training: 0it [00:00, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-8-89b1728bd5a6> in <module>
      7       --gpus 1
      8     """.split()
----> 9 run_training(args)

<ipython-input-5-06912a89c08b> in run_training(input)
     86     )
     87     print("## dm ##", dm)
---> 88     trainer.fit(model, dm)
     89     model_file = os.path.join(args.modeldir, 'last.ckpt')
     90     trainer.save_checkpoint(model_file, weights_only=True)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    497 
    498         # dispath `start_training` or `start_testing` or `start_predicting`
--> 499         self.dispatch()
    500 
    501         # plugin will finalized fitting (e.g. ddp_spawn will load trained model)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
    544 
    545         else:
--> 546             self.accelerator.start_training(self)
    547 
    548     def train_or_test_or_predict(self):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
     71 
     72     def start_training(self, trainer):
---> 73         self.training_type_plugin.start_training(trainer)
     74 
     75     def start_testing(self, trainer):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py in start_training(self, trainer)
    112     def start_training(self, trainer: 'Trainer') -> None:
    113         # double dispatch to initiate the training loop
--> 114         self._results = trainer.run_train()
    115 
    116     def start_testing(self, trainer: 'Trainer') -> None:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in run_train(self)
    635                 with self.profiler.profile("run_training_epoch"):
    636                     # run train epoch
--> 637                     self.train_loop.run_training_epoch()
    638 
    639                 if self.max_steps and self.max_steps <= self.global_step:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    482         val_loop_called = False
    483 
--> 484         for batch_idx, (batch, is_last_batch) in train_dataloader:
    485 
    486             self.trainer.batch_idx = batch_idx

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers.py in profile_iterable(self, iterable, action_name)
     80             try:
     81                 self.start(action_name)
---> 82                 value = next(iterator)
     83                 self.stop(action_name)
     84                 yield value

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/data_connector.py in _with_is_last(self, iterable)
     45         See `https://stackoverflow.com/a/1630350 <https://stackoverflow.com/a/1630350>`_"""
     46         it = iter(iterable)
---> 47         last = next(it)
     48         for val in it:
     49             # yield last and has next

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters.py in __next__(self)
    468 
    469         """
--> 470         return self.request_next_batch(self.loader_iters)
    471 
    472     @staticmethod

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters.py in request_next_batch(loader_iters)
    482 
    483         """
--> 484         return apply_to_collection(loader_iters, Iterator, next)
    485 
    486     @staticmethod

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_func.py in apply_to_collection(data, dtype, function, wrong_dtype, *args, **kwargs)
     82     # Breaking condition
     83     if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 84         return function(data, *args, **kwargs)
     85 
     86     # Recursively apply to collection items

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    433         if self._sampler_iter is None:
    434             self._reset()
--> 435         data = self._next_data()
    436         self._num_yielded += 1
    437         if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    473     def _next_data(self):
    474         index = self._next_index()  # may raise StopIteration
--> 475         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    476         if self._pin_memory:
    477             data = _utils.pin_memory.pin_memory(data)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     45         else:
     46             data = self.dataset[possibly_batched_index]
---> 47         return self.collate_fn(data)

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     81             raise RuntimeError('each element in list of batch should be of equal size')
     82         transposed = zip(*batch)
---> 83         return [default_collate(samples) for samples in transposed]
     84 
     85     raise TypeError(default_collate_err_msg_format.format(elem_type))

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in <listcomp>(.0)
     81             raise RuntimeError('each element in list of batch should be of equal size')
     82         transposed = zip(*batch)
---> 83         return [default_collate(samples) for samples in transposed]
     84 
     85     raise TypeError(default_collate_err_msg_format.format(elem_type))

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
     79         elem_size = len(next(it))
     80         if not all(len(elem) == elem_size for elem in it):
---> 81             raise RuntimeError('each element in list of batch should be of equal size')
     82         transposed = zip(*batch)
     83         return [default_collate(samples) for samples in transposed]

RuntimeError: each element in list of batch should be of equal size

Datasets

All are .csv tab-delimited.

classes.csv:

Activity    Event
Actor   Person
Agent   Person
Album   Product
Animal  Object
ArchitecturalStructure  Location
Artist  Person
Athlete Person
AutomobileEngine    Product
Award   Object
Biomolecule Object
Bird    Object
BodyOfWater Location
Building    Location

Universities.csv:

University  Country
A.T. Still University   US
Aalborg Universitet DK
Aalto-yliopisto FI

train.csv:

Writer  Person  Petar Hektorovi\u0107 Petar Kanaveli\u0107 Petar Ko\u010Di\u0107 Petar \u0160egedin (writer) Petar Zorani\u0107 Pete Johnson (author) Pete Prown Peter Abrahams (American author) Peter-Adrian Cohen Peter Aleshkovsky

train_textcorrupted.csv (deliberate misspellings):

Person  First Name   mogu creek new souhth wales moguchinsky distrect mogud mogulo subregion mogute de bagaces moguytuysky distrect mogriguy mogtédo departemynt mogumwer natior reserve mohajeran rural distrect mohale's hoek distrect mohali distrect mohamyd moge distrect mohammadabad rural distrect alborz province mohammadabad rural distrect anbarabad conty mohammadabad rural distrect fars province mohammadabad rural distrect yazd province mohammadabad rural distrect zarand conty mohammadgarh staet mohammadiyeh rural distrect

val.csv:

Animal  Object  Bryolymnia poasia Bryolymnia semifascia Bryolymnia viridata Bryolymnia viridimedia Bryomima Bryomixis Bryomoia Bryonycta Bryonympha Bryophaenocladius
Event   Event   1937 Cup of the Ukrainian SSR 1937 Donington Grand Prix 1937 Emperor's Cup Final 1937 FA Charity Shield 1937 FA Cup Final 1937 Finnish presidential election 1937 French Championships (tennis) 1937 French Grand Prix 1937 German football championship 1937 German Grand Prix
Animal  Object  Archernis mitis Archernis nictitans Archernis obliquialis Archernis scopulalis Archers Bay Archetypomys Arch Hall (horse) Arch (horse) Archibasis lieftincki Archiborborus

Please let me know if I should add anything else.

StressedBoi69420
  • 1,376
  • 1
  • 12
  • 40
  • 1
    Is this a tensorflow question ? If not you should remove that tag. – Mohan Radhakrishnan Sep 11 '21 at 08:36
  • Error seems to be related to various `pytorch` tools. I will remove `tensorflow` tag. – StressedBoi69420 Sep 13 '21 at 08:12
  • @StressedBoi69420 did you solve it ? BTW, the code you posted isn't really relevant. The problem seems to be in your Dataset's `__getitem__()` or somewhere .. clearly its failing to "collate" your data sample in a batch. – ayandas Sep 17 '21 at 13:38

0 Answers0