I am running the audiolm implementation from github and facing error in the following
audiolm = AudioLM(
wav2vec = wav2vec,
codec = soundstream,
semantic_transformer = semantic_transformer,
coarse_transformer = coarse_transformer,
fine_transformer = fine_transformer
)
text = "The sound of a violin playing a sad melody"
generated_wav = audiolm(text=text, batch_size=1)
I have tried changing the dimensions in the transformers, but the issue is still there,
fine_transformer = FineTransformer( num_coarse_quantizers = 3, num_fine_quantizers = 5, codebook_size = 1024, dim = 1024, depth = 6, audio_text_condition=True, # this must be set to True (same for SemanticTransformer and FineTransformer) )
coarse_transformer = CoarseTransformer( num_semantic_tokens = wav2vec.codebook_size, codebook_size = 1024, num_coarse_quantizers = 3, dim = 1024, depth = 6, audio_text_condition=True, # this must be set to True (same for SemanticTransformer and FineTransformer) )
semantic_transformer = SemanticTransformer( num_semantic_tokens = wav2vec.codebook_size, dim = 1024, depth = 6, audio_text_condition = True # this must be set to True (same for CoarseTransformer and FineTransformers) ).cuda()
but I am still get the following error,
AssertionError: you had specified a conditioning dimension of 1024, yet what was received by the transformer has dimension of 768