I want to convert a speech file to text file on linux platform. On research, I got pocketsphinx
speech recognition tool. With the help of this post. I used the below command
pocketsphinx_continuous -infile file.wav
The file.wav is 16bit,16Khz and mono channel.
And the output is
INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from /usr/local/share/pocketsphinx/model/en-us/en-us/feat.params
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-allphone
-allphone_ci no no
-alpha 0.97 9.700000e-01
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-48
-bestpath yes yes
-bestpathlw 9.5 9.500000e+00
-ceplen 13 13
-cmn current current
-cmninit 8.0 40,3,-1
-compallsen no no
-debug 0
-dict /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd 1s_c_d_dd
-featparams
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm /usr/local/share/pocketsphinx/model/en-us/en-us
-input_endian little little
-jsgf
-keyphrase
-kws
-kws_delay 10 10
-kws_plp 1e-1 1.000000e-01
-kws_threshold 1 1.000000e+00
-latsize 5000 5000
-lda
-ldadim 0 0
-lifter 0 22
-lm /usr/local/share/pocketsphinx/model/en-us/en-us.lm.bin
-lmctl
-lmname
-logbase 1.0001 1.000100e+00
-logfn
-logspec no no
-lowerf 133.33334 1.300000e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 6.500000e+00
-maxhmmpf 30000 30000
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-min_endfr 0 0
-mixw
-mixwfloor 0.0000001 1.000000e-07
-mllr
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 25
-nwpen 1.0 1.000000e+00
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-10 1.000000e-10
-pl_pip 1.0 1.000000e+00
-pl_weight 3.0 3.000000e+00
-pl_window 5 5
-rawlogdir
-remove_dc no no
-remove_noise yes yes
-remove_silence yes yes
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-sendump
-senlogdir
-senmgau
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec 0-12/13-25/26-38
-tmat
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 6.800000e+03
-uw 1.0 1.000000e+00
-vad_postspeech 50 50
-vad_prespeech 20 20
-vad_startspeech 10 10
-vad_threshold 2.0 2.000000e+00
-var
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-29
-wip 0.65 6.500000e-01
-wlen 0.025625 2.562500e-02
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0
INFO: acmod.c(164): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(518): Reading model definition: /usr/local/share/pocketsphinx/model/en-us/en-us/mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(336): Reading binary model definition: /usr/local/share/pocketsphinx/model/en-us/en-us/mdef
INFO: bin_mdef.c(516): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
INFO: tmat.c(206): Reading HMM transition probability matrices: /usr/local/share/pocketsphinx/model/en-us/en-us/transition_matrices
INFO: acmod.c(117): Attempting to use PTM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /usr/local/share/pocketsphinx/model/en-us/en-us/means
INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /usr/local/share/pocketsphinx/model/en-us/en-us/variances
INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(294): 128x13
INFO: ms_gauden.c(354): 222 variance values floored
INFO: ptm_mgau.c(476): Loading senones from dump file /usr/local/share/pocketsphinx/model/en-us/en-us/sendump
INFO: ptm_mgau.c(500): BEGIN FILE FORMAT DESCRIPTION
INFO: ptm_mgau.c(563): Rows: 128, Columns: 5126
INFO: ptm_mgau.c(595): Using memory-mapped I/O for senones
INFO: ptm_mgau.c(835): Maximum top-N: 4
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 138623 * 20 bytes (2707 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict
INFO: dict.c(213): Allocated 1014 KiB for strings, 1677 KiB for phones
INFO: dict.c(336): 134522 words read
INFO: dict.c(358): Reading filler dictionary: /usr/local/share/pocketsphinx/model/en-us/en-us/noisedict
INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 5 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(347): Trying to read LM in trie binary format
INFO: ngram_search_fwdtree.c(99): 790 unique initial diphones
INFO: ngram_search_fwdtree.c(148): 0 root, 0 non-root channels, 57 single-phone words
INFO: ngram_search_fwdtree.c(186): Creating search tree
INFO: ngram_search_fwdtree.c(192): before: 0 root, 0 non-root channels, 57 single-phone words
INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 152144
INFO: ngram_search_fwdtree.c(339): after: 722 root, 152016 non-root channels, 53 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Mar 30 2016, AT: 12:15:51
INFO: ngram_search.c(467): Resized score stack to 200000 entries
INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
INFO: cmn_prior.c(99): cmn_prior_update: from < 0.00 0.00 -nan 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(116): cmn_prior_update: to < 47.48 0.98 -3.82 4.09 3.14 2.74 2.80 -12.69 -12.23 -5.07 6.26 -9.97 -3.79 >
INFO: ngram_search.c(467): Resized score stack to 400000 entries
INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
INFO: cmn_prior.c(131): cmn_prior_update: from < 47.48 0.98 -3.82 4.09 3.14 2.74 2.80 -12.69 -12.23 -5.07 6.26 -9.97 -3.79 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 46.64 -1.32 -1.28 2.40 -0.03 2.35 3.75 -12.45 -10.69 -4.19 5.41 -9.31 -3.67 >
INFO: ngram_search_fwdtree.c(1553): 10113 words recognized (10/fr)
INFO: ngram_search_fwdtree.c(1555): 826921 senones evaluated (857/fr)
INFO: ngram_search_fwdtree.c(1559): 2018242 channels searched (2091/fr), 83185 1st, 454283 last
INFO: ngram_search_fwdtree.c(1562): 19973 words for which last channels evaluated (20/fr)
INFO: ngram_search_fwdtree.c(1564): 157044 candidate words for entering last phone (162/fr)
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.83 CPU 0.190 xRT
INFO: ngram_search_fwdtree.c(1570): fwdtree 1.84 wall 0.190 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 146 words
INFO: ngram_search_fwdflat.c(948): 6130 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950): 550472 senones evaluated (570/fr)
INFO: ngram_search_fwdflat.c(952): 702021 channels searched (727/fr)
INFO: ngram_search_fwdflat.c(954): 34182 words searched (35/fr)
INFO: ngram_search_fwdflat.c(957): 5373 word transitions (5/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.69 CPU 0.072 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.69 wall 0.072 xRT
INFO: ngram_search.c(1200): </s> not found in last frame, using watch.963 instead
INFO: ngram_search.c(1253): lattice start node <s>.0 end node watch.3
INFO: ngram_search.c(1279): Eliminated 498 nodes before end node
INFO: ngram_search.c(1384): Lattice has 634 nodes, 1 links
INFO: ps_lattice.c(1380): Bestpath score: -917
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(watch:3:963) = -1854347
INFO: ps_lattice.c(1441): Joint P(O,S) = -1854347 P(S|O) = 0
INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
watch
INFO: cmn_prior.c(131): cmn_prior_update: from < 46.64 -1.32 -1.28 2.40 -0.03 2.35 3.75 -12.45 -10.69 -4.19 5.41 -9.31 -3.67 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 46.64 -1.32 -1.28 2.40 -0.03 2.35 3.75 -12.45 -10.69 -4.19 5.41 -9.31 -3.67 >
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 0 words
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.83 CPU 0.190 xRT
INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 1.84 wall 0.190 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.69 CPU 0.072 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.69 wall 0.072 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.000 xRT
With this output, I'm thinking that file.wav is read by the tool. But, due to unavailability of words in dict file, the text is not recognized. I don't know whether I'm thinking correctly or not.
I even used
pocketsphinx_continuous -infile brain_mono_8000.wav -hmm en_US/hub4wsj_sc_8k -lm en_US/hub4.5000.DMP
I got error:
ERROR: "pocketsphinx.c", line 223: Failed to find mdef file inside the model folder specified with -hmm `en_US/hub4wsj_sc_8k'
With this error, I'm stuck here. Why this error is coming? How can I modify the .dict file to my requirement. Please suggest me.