Fine tune a VITS model with the Coqui TTS framework

Question

I am executing the following notebook:

The notebook is associated with the following YouTube video:

Updated | Near-Automated Voice Cloning | Whisper STT + Coqui TTS | Fine Tune a VITS Model on Colab

I am able to successfully progress through the notebook until the trainer.fit() section. At this point I get the error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 14.75 GiB total capacity; 13.41 GiB already allocated; 2.81 MiB free; 13.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Following the advise at How to avoid "CUDA out of memory" in PyTorch, I added the following code to the Notebook:

```
import torch
torch.cuda.empty_cache()
```
I also reduced my batch size all the way down to 1.
I have also added/set the code %env 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512' before calling the "trainer.fit()" function.

Finally, it should be noted that I am using a Colab Pro account.

If anybody can try out the above notebook and let me know their experience, and also provide any solution to complete the fine tuning and obtain the resultant model.

PS - The following is a summary of my data set:

> DataLoader initialization
| > Tokenizer:
    | > add_blank: True
    | > use_eos_bos: False
    | > use_phonemes: True
    | > phonemizer:
        | > phoneme language: en-us
        | > phoneme backend: espeak
| > Number of instances : 131
 | > Preprocessing samples
 | > Max text length: 280
 | > Min text length: 30
 | > Avg text length: 142.8320610687023
 | 
 | > Max audio length: 389781.0
 | > Min audio length: 103209.0
 | > Avg audio length: 225843.31297709924
 | > Num. instances discarded samples: 0
 | > Batch group size: 256.

Fine tune a VITS model with the Coqui TTS framework

0 Answers0