I am using JupyterLab
on AWS SageMaker
. Kernel: conda_pytorch_latest_p36
.
Traceback suggests a problem with a list
object, maybe in regards to batch size? Most likely from my datasets.
When batch_size = 1
: RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated
.
Called Code:
# re-train cell
args = """
--max_epochs 20
--progress_bar_refresh_rate 2
--gradient_clip_val 0.5
--log_gpu_memory True
--gpus 1
""".split()
run_training(args)
run_training()
:
import importlib
import os
import shutil
import tarfile
from argparse import ArgumentParser
import mlflow
import pytorch_lightning as pl
def run_training(input=None):
args = parse_args(input)
pl.seed_everything(args.seed)
module = importlib.import_module('pytorch_lightning.loggers')
logger = getattr(module, args.logging)(save_dir='logs')
csv_logger = pl.loggers.CSVLogger(save_dir=f'{args.modeldir}/csv_logs')
loggers = [logger, csv_logger]
dm = OntologyTaggerDataModule.from_argparse_args(args)
if args.model_uri:
local_model_uri = os.environ.get('SM_CHANNEL_MODEL', '.')
tar_path = os.path.join(local_model_uri, 'model.tar.gz')
tar = tarfile.open(tar_path, "r:gz")
tar.extractall(local_model_uri)
tar.close()
model_path = os.path.join(local_model_uri, args.checkpointfile)
model = OntologyTaggerModel.load_from_checkpoint(model_path)
elif args.checkpointfile:
file_path = os.path.join(args.traindir, args.checkpointfile)
model = OntologyTaggerModel.load_from_checkpoint(file_path)
else:
model = OntologyTaggerModel(
**vars(args), num_classes=dm.num_classes, class_map=dm.class_map
)
checkpoint_callback = pl.callbacks.ModelCheckpoint(
args.checkpointdir, save_last=True, save_weights_only=True
)
checkpoint_dir = os.environ.get('SM_CHANNEL_TRAINING', './')
if checkpoint_dir != './':
labels_file_orig = os.path.join(checkpoint_dir, args.labels)
labels_file_cp = os.path.join(args.modeldir, os.path.basename(args.labels))
shutil.copyfile(labels_file_orig, labels_file_cp)
trainer = pl.Trainer.from_argparse_args(
args, callbacks=[checkpoint_callback], logger=loggers
)
print('model', model)
print('dm', dm)
trainer.fit(model, dm) # CRASH !
model_file = os.path.join(args.modeldir, 'last.ckpt')
trainer.save_checkpoint(model_file, weights_only=True)
Traceback:
Global seed set to 42
Loading pretrained model bert-base-cased
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMulticlassSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForMulticlassSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMulticlassSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMulticlassSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifiers.1.bias', 'classifiers.0.weight', 'classifiers.1.weight', 'classifiers.0.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
warnings.warn(*args, **kwargs)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
## dm ## <__main__.OntologyTaggerDataModule object at 0x7f80883dcc18>
| Name | Type | Params
----------------------------------------------------------------------------
0 | model | BertForMulticlassSequenceClassification | 108 M
1 | valid_acc | Accuracy | 0
2 | valid_f1 | F1 | 0
3 | valid_acc_multi | ModuleList | 0
----------------------------------------------------------------------------
108 M Trainable params
0 Non-trainable params
108 M Total params
433.410 Total estimated model params size (MB)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]
###score: val_score### 0.11666667461395264
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric Accuracy was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric F1 was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torchmetrics/utilities/prints.py:36: UserWarning: The ``compute`` method of metric CompositionalMetric was called before the ``update`` method which may lead to errors, as metric states have not yet been updated.
warnings.warn(*args, **kwargs)
/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Training: 0it [00:00, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-8-89b1728bd5a6> in <module>
7 --gpus 1
8 """.split()
----> 9 run_training(args)
<ipython-input-5-06912a89c08b> in run_training(input)
86 )
87 print("## dm ##", dm)
---> 88 trainer.fit(model, dm)
89 model_file = os.path.join(args.modeldir, 'last.ckpt')
90 trainer.save_checkpoint(model_file, weights_only=True)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
497
498 # dispath `start_training` or `start_testing` or `start_predicting`
--> 499 self.dispatch()
500
501 # plugin will finalized fitting (e.g. ddp_spawn will load trained model)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in dispatch(self)
544
545 else:
--> 546 self.accelerator.start_training(self)
547
548 def train_or_test_or_predict(self):
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py in start_training(self, trainer)
71
72 def start_training(self, trainer):
---> 73 self.training_type_plugin.start_training(trainer)
74
75 def start_testing(self, trainer):
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py in start_training(self, trainer)
112 def start_training(self, trainer: 'Trainer') -> None:
113 # double dispatch to initiate the training loop
--> 114 self._results = trainer.run_train()
115
116 def start_testing(self, trainer: 'Trainer') -> None:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py in run_train(self)
635 with self.profiler.profile("run_training_epoch"):
636 # run train epoch
--> 637 self.train_loop.run_training_epoch()
638
639 if self.max_steps and self.max_steps <= self.global_step:
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
482 val_loop_called = False
483
--> 484 for batch_idx, (batch, is_last_batch) in train_dataloader:
485
486 self.trainer.batch_idx = batch_idx
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers.py in profile_iterable(self, iterable, action_name)
80 try:
81 self.start(action_name)
---> 82 value = next(iterator)
83 self.stop(action_name)
84 yield value
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/data_connector.py in _with_is_last(self, iterable)
45 See `https://stackoverflow.com/a/1630350 <https://stackoverflow.com/a/1630350>`_"""
46 it = iter(iterable)
---> 47 last = next(it)
48 for val in it:
49 # yield last and has next
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters.py in __next__(self)
468
469 """
--> 470 return self.request_next_batch(self.loader_iters)
471
472 @staticmethod
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters.py in request_next_batch(loader_iters)
482
483 """
--> 484 return apply_to_collection(loader_iters, Iterator, next)
485
486 @staticmethod
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_func.py in apply_to_collection(data, dtype, function, wrong_dtype, *args, **kwargs)
82 # Breaking condition
83 if isinstance(data, dtype) and (wrong_dtype is None or not isinstance(data, wrong_dtype)):
---> 84 return function(data, *args, **kwargs)
85
86 # Recursively apply to collection items
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
433 if self._sampler_iter is None:
434 self._reset()
--> 435 data = self._next_data()
436 self._num_yielded += 1
437 if self._dataset_kind == _DatasetKind.Iterable and \
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _next_data(self)
473 def _next_data(self):
474 index = self._next_index() # may raise StopIteration
--> 475 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
476 if self._pin_memory:
477 data = _utils.pin_memory.pin_memory(data)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
---> 47 return self.collate_fn(data)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
81 raise RuntimeError('each element in list of batch should be of equal size')
82 transposed = zip(*batch)
---> 83 return [default_collate(samples) for samples in transposed]
84
85 raise TypeError(default_collate_err_msg_format.format(elem_type))
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in <listcomp>(.0)
81 raise RuntimeError('each element in list of batch should be of equal size')
82 transposed = zip(*batch)
---> 83 return [default_collate(samples) for samples in transposed]
84
85 raise TypeError(default_collate_err_msg_format.format(elem_type))
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
79 elem_size = len(next(it))
80 if not all(len(elem) == elem_size for elem in it):
---> 81 raise RuntimeError('each element in list of batch should be of equal size')
82 transposed = zip(*batch)
83 return [default_collate(samples) for samples in transposed]
RuntimeError: each element in list of batch should be of equal size
Datasets
All are .csv
tab-delimited.
classes.csv
:
Activity Event
Actor Person
Agent Person
Album Product
Animal Object
ArchitecturalStructure Location
Artist Person
Athlete Person
AutomobileEngine Product
Award Object
Biomolecule Object
Bird Object
BodyOfWater Location
Building Location
Universities.csv
:
University Country
A.T. Still University US
Aalborg Universitet DK
Aalto-yliopisto FI
train.csv
:
Writer Person Petar Hektorovi\u0107 Petar Kanaveli\u0107 Petar Ko\u010Di\u0107 Petar \u0160egedin (writer) Petar Zorani\u0107 Pete Johnson (author) Pete Prown Peter Abrahams (American author) Peter-Adrian Cohen Peter Aleshkovsky
train_textcorrupted.csv
(deliberate misspellings):
Person First Name mogu creek new souhth wales moguchinsky distrect mogud mogulo subregion mogute de bagaces moguytuysky distrect mogriguy mogtédo departemynt mogumwer natior reserve mohajeran rural distrect mohale's hoek distrect mohali distrect mohamyd moge distrect mohammadabad rural distrect alborz province mohammadabad rural distrect anbarabad conty mohammadabad rural distrect fars province mohammadabad rural distrect yazd province mohammadabad rural distrect zarand conty mohammadgarh staet mohammadiyeh rural distrect
val.csv
:
Animal Object Bryolymnia poasia Bryolymnia semifascia Bryolymnia viridata Bryolymnia viridimedia Bryomima Bryomixis Bryomoia Bryonycta Bryonympha Bryophaenocladius
Event Event 1937 Cup of the Ukrainian SSR 1937 Donington Grand Prix 1937 Emperor's Cup Final 1937 FA Charity Shield 1937 FA Cup Final 1937 Finnish presidential election 1937 French Championships (tennis) 1937 French Grand Prix 1937 German football championship 1937 German Grand Prix
Animal Object Archernis mitis Archernis nictitans Archernis obliquialis Archernis scopulalis Archers Bay Archetypomys Arch Hall (horse) Arch (horse) Archibasis lieftincki Archiborborus
Please let me know if I should add anything else.