I was trying to reproduce this Hugging Face tutorial on T5-like span masked-language-modeling.
I have the following code tokenizing_and_configing.py
:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
from transformers import T5Config
vocab_size = 32_000
input_sentence_size = None
# Calculate the total number of samples in the dataset
total_samples = datasets.load_dataset(
"nthngdy/oscar-mini", name="unshuffled_deduplicated_no", split="train"
).num_rows
# Calculate one thirtieth of the total samples
subset_samples = total_samples // 30
# Load one thirtieth of the dataset
dataset = datasets.load_dataset(
"nthngdy/oscar-mini",
name="unshuffled_deduplicated_no",
split=f"train[:{subset_samples}]",
)
tokenizer = SentencePieceUnigramTokenizer(
unk_token="<unk>", eos_token="</s>", pad_token="<pad>"
)
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i : i + batch_length]["text"]
print("Train Tokenizer")
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./models/norwegian-t5-base/tokenizer.json")
print("DONE TOKENIZING ")
# CONFIG
config = T5Config.from_pretrained(
"google/t5-v1_1-small",
vocab_size=tokenizer.get_vocab_size()
# "google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size()
)
config.save_pretrained("./models/norwegian-t5-base")
print("DONE SAVING TOKENIZER ")
The dependency can be found here:
After tokenizing_and_configing.py
is completed. I run this code:
python run_t5_mlm_flax.py \
--output_dir="./models/norwegian-t5-base" \
--model_type="t5" \
--config_name="./models/norwegian-t5-base" \
--tokenizer_name="./models/norwegian-t5-base" \
--dataset_name="nthngdy/oscar-mini" \
--dataset_config_name="unshuffled_deduplicated_no" \
--max_seq_length="512" \
--per_device_train_batch_size="32" \
--per_device_eval_batch_size="32" \
--adafactor \
--learning_rate="0.005" \
--weight_decay="0.001" \
--warmup_steps="2000" \
--overwrite_output_dir \
--logging_steps="500" \
--save_steps="10000" \
--eval_steps="2500" \
--do_train \
--do_eval
The full code for run_t5_mlm_flax.py
can be found here.
But after run_t5_mlm_flax.py
is completed , I can only find these files in ./model/norwegian-t5-base
:
.
└── norwegian-t5-base
├── config.json
├── events.out.tfevents.1680920382.ip-172-31-30-81.71782.0.v2
└── tokenizer.json
└── eval_results.json
What's wrong with my process. I expect it to produce more files like these:
- flax_model.msgpack: This file contains the weights of the fine-tuned Flax model.
- tokenizer_config.json: This file contains the tokenizer configuration, such as the vocabulary size and special tokens.
- training_args.bin: This file contains the training arguments used during fine-tuning, such as learning rate and batch size.
- merges.txt: This file is part of the tokenizer and contains the subword merges.
- vocab.json: This file is part of the tokenizer and contains the vocabulary mappings.
- train.log: Logs from the training process, including loss, learning rate, and other metrics.
- Checkpoint files: If you have enabled checkpoints during training, you will find checkpoint files containing the model weights at specific training steps.
Additional note: I don't experience any error messages AT ALL. Everything completes smoothly without interruption. I'm using Amazon AWS p3.2xlarge; cuda_11.2.r11.2/compiler.29618528_0