festival 2.4: why do some voices not work with singing mode?

Question

voice_kal_diphone and voice_ral_diphone work correctly in singing mode (there's vocal output and the pitches are correct for the specified notes).

voice_cmu_us_ahw_cg and the other CMU voices do not work correctly--there's vocal output but the pitch is not changed according to the specified notes.

Is it possible to get correct output with the higher quality CMU voices?

The command line for working (pitch-affected) output is:

text2wave -mode singing -eval "(voice_kal_diphone)" -o song.wav song.xml

The command line for non-working (pitch-unaffected) output is:

text2wave -mode singing -eval "(voice_cmu_us_ahw_cg)" -o song.wav song.xml

Here's song.xml:

<?xml version="1.0"?>
<!DOCTYPE SINGING PUBLIC "-//SINGING//DTD SINGING mark up//EN" "Singing.v0_1.dtd" []>
<SINGING BPM="60">
 <PITCH NOTE="A4,C4,C4"><DURATION BEATS="0.3,0.3,0.3">nationwide</DURATION></PITCH>
 <PITCH NOTE="C4"><DURATION BEATS="0.3">is</DURATION></PITCH>
 <PITCH NOTE="D4"><DURATION BEATS="0.3">on</DURATION></PITCH>
 <PITCH NOTE="F4"><DURATION BEATS="0.3">your</DURATION></PITCH>
 <PITCH NOTE="F4"><DURATION BEATS="0.3">side</DURATION></PITCH>
</SINGING>

You may also need this patch to singing-mode.scm:

@@ -339,7 +339,9 @@
 (defvar singing-max-short-vowel-length 0.11)

 (define (singing_do_initial utt token)
-  (if (equal? (item.name token) "")
+  (if (and
+        (not (equal? nil token))
+        (equal? (item.name token) ""))
       (let ((restlen (car (item.feat token 'rest))))
         (if singing-debug
             (format t "restlen %l\n" restlen))

To set up my environment I used the festvox fest_build script. You can also download voice_cmu_us_ahw_cg separately.

Have you built your own voice? "voice_cmu_us_ahw_cg" is not available on current voice list. If it is a community voice, then it may be still in beta stage. — Kiran Shakya, Dec 11 '15 at 10:54
@kiran: I used the "fest_build" script from festvox.org: http://festvox.org/fest_build and the specific voice is available here: http://festvox.org/packed/festival/2.4/voices/festvox_cmu_us_ahw_cg.tar.gz — Beau, Dec 11 '15 at 17:16
"This software doesn't do what I want" isn't really a programming question, but I can't flag it as off-topic because of the bounty. Unless you are trying to write code to fix it, this seems like a general software question. — TessellatingHeckler, Dec 11 '15 at 17:23
@TessellatingHeckler: I think there is a code-related reason for why it doesn't work but I don't have the expertise and hoped there was someone more familiar with festival who could point out where/what needed to change. — Beau, Dec 11 '15 at 17:44

avtomaton · Accepted Answer · 2015-12-11T22:17:47.783

It seems that the problem is in phones generation.

voice_kal_diphone uses UniSyn synthesis model, while voice_cmu_us_ahw_cg uses ClusterGen model. The last one has own intonation and duration model (state-based) instead of phone intonation/duration: possibly you noticed that duration didn't changed too in generated 'song'.

singing-mode.scm tries to extract each syllable and modify its frequency. In case of ClusterGen model wave generator simply ignores syllables frequencies and durations set in Target due to different modelling.

As a result we have better voice quality (based on statistic model), but can't change frequency directly.

Very good description of generation pipeline can be found here.

festival 2.4: why do some voices not work with singing mode?

1 Answers1