1

I was trying to reproduce this Hugging Face tutorial on T5-like span masked-language-modeling.

I have the following code tokenizing_and_configing.py:

import datasets

from t5_tokenizer_model import SentencePieceUnigramTokenizer
from transformers import T5Config


vocab_size = 32_000
input_sentence_size = None

# Calculate the total number of samples in the dataset
total_samples = datasets.load_dataset(
    "nthngdy/oscar-mini", name="unshuffled_deduplicated_no", split="train"
).num_rows

# Calculate one thirtieth of the total samples
subset_samples = total_samples // 30

# Load one thirtieth of the dataset
dataset = datasets.load_dataset(
    "nthngdy/oscar-mini",
    name="unshuffled_deduplicated_no",
    split=f"train[:{subset_samples}]",
)

tokenizer = SentencePieceUnigramTokenizer(
    unk_token="<unk>", eos_token="</s>", pad_token="<pad>"
)


# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
    if input_sentence_size is None:
        input_sentence_size = len(dataset)
    batch_length = 100
    for i in range(0, input_sentence_size, batch_length):
        yield dataset[i : i + batch_length]["text"]


print("Train Tokenizer")
# Train tokenizer
tokenizer.train_from_iterator(
    iterator=batch_iterator(input_sentence_size=input_sentence_size),
    vocab_size=vocab_size,
    show_progress=True,
)

# Save files to disk
tokenizer.save("./models/norwegian-t5-base/tokenizer.json")

print("DONE TOKENIZING ")

# CONFIG
config = T5Config.from_pretrained(
    "google/t5-v1_1-small",
    vocab_size=tokenizer.get_vocab_size()
    # "google/t5-v1_1-base", vocab_size=tokenizer.get_vocab_size()
)
config.save_pretrained("./models/norwegian-t5-base")

print("DONE SAVING TOKENIZER ")

The dependency can be found here:

After tokenizing_and_configing.py is completed. I run this code:

python run_t5_mlm_flax.py \
    --output_dir="./models/norwegian-t5-base" \
    --model_type="t5" \
    --config_name="./models/norwegian-t5-base" \
    --tokenizer_name="./models/norwegian-t5-base" \
    --dataset_name="nthngdy/oscar-mini" \
    --dataset_config_name="unshuffled_deduplicated_no" \
    --max_seq_length="512" \
    --per_device_train_batch_size="32" \
    --per_device_eval_batch_size="32" \
    --adafactor \
    --learning_rate="0.005" \
    --weight_decay="0.001" \
    --warmup_steps="2000" \
    --overwrite_output_dir \
    --logging_steps="500" \
    --save_steps="10000" \
    --eval_steps="2500" \
    --do_train \
    --do_eval

The full code for run_t5_mlm_flax.py can be found here.

But after run_t5_mlm_flax.py is completed , I can only find these files in ./model/norwegian-t5-base:

.
└── norwegian-t5-base
    ├── config.json
    ├── events.out.tfevents.1680920382.ip-172-31-30-81.71782.0.v2
    └── tokenizer.json
    └── eval_results.json

What's wrong with my process. I expect it to produce more files like these:

  1. flax_model.msgpack: This file contains the weights of the fine-tuned Flax model.
  2. tokenizer_config.json: This file contains the tokenizer configuration, such as the vocabulary size and special tokens.
  3. training_args.bin: This file contains the training arguments used during fine-tuning, such as learning rate and batch size.
  4. merges.txt: This file is part of the tokenizer and contains the subword merges.
  5. vocab.json: This file is part of the tokenizer and contains the vocabulary mappings.
  6. train.log: Logs from the training process, including loss, learning rate, and other metrics.
  7. Checkpoint files: If you have enabled checkpoints during training, you will find checkpoint files containing the model weights at specific training steps.

Additional note: I don't experience any error messages AT ALL. Everything completes smoothly without interruption. I'm using Amazon AWS p3.2xlarge; cuda_11.2.r11.2/compiler.29618528_0

littleworth
  • 4,781
  • 6
  • 42
  • 76
  • 1
    The script you are using only saves the model when the current_step is can be divided by your specified save_steps to an integer `if cur_step % training_args.save_steps == 0 and cur_step > 0` -> reduce `save_steps` parameter or add another `model.save_pretrained` to your script (maybe after each epoch?). – cronoik Apr 10 '23 at 20:41
  • @cronoik Many thanks. reducing the `save_steps` parameter does the job. – littleworth Apr 11 '23 at 00:25

0 Answers0