Sphinx: Installing pocketsphinx and sphinxtrain on the OrangePi 4B

This post is the second one of a series regarding generating subtitles for videos with the OrangePi 4B running Armbian. We’re not yet sure of what we’ll use to generate those subtitles, so for the time being, we’ll use Sphinx, and in the following months, we’ll try to setup something that will take advantage of the NPU.

Today, we’ll install PocketSphinx, SphinxTrain and adapt an existing model before testing it.



git clone https://github.com/cmusphinx/pocketsphinx.git
cd pocketsphinx/
config.status: executing depfiles commands
config.status: executing libtool commands
Now type `make' to compile the package.

Bingo! Time for a good old

make -j6
make check
Testsuite summary for pocketsphinx 5prealpha
# TOTAL: 5
# PASS:  5
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
make[4]: Leaving directory '/home/poddingue/pocketsphinx/test/regression'

All went fine. Let’s install it then:

 sudo -E make install

Done. Next step, install Sphinx train.


git clone https://github.com/cmusphinx/sphinxtrain.git
cd sphinxtrain/
make -j6
make check
sudo -E make install
mkdir en-us
poddingue@orangepi4-armbian:~$ cd !$

wget https://iweb.dl.sourceforge.net/project/cmusphinx/Acoustic%20and%20Language%20Models/US%20English/cmusphinx-en-us-5.2.tar.gz

tar -xvzf cmusphinx-en-us-5.2.tar.gz cmusphinx-en-us-5.2/
poddingue@orangepi4-armbian:~/en-us$ file cmusphinx-en-us-5.2/mdef
cmusphinx-en-us-5.2/mdef: ASCII text

Now that SphinxTrain has been installed, and the model has been downloaded, let’s tackle with samples. Not ours. Not yet. Samples that do work.

Creating an adaptation corpus

As stated in this tutorial, you will find files needed to create an adaptation corpus there: http://festvox.org/cmu_arctic/packed/.

Getting the audio and transcription files to adapt the acoustic model

wget http://festvox.org/cmu_arctic/packed/cmu_us_bdl_arctic.tar.bz2
wget http://festvox.org/cmu_arctic/packed/cmu_us_rms_arctic.tar.bz2
wget http://festvox.org/cmu_arctic/packed/cmu_us_clb_arctic.tar.bz2
bunzip2 cmu_us_bdl_arctic.tar.bz2
bunzip2 cmu_us_clb_arctic.tar.bz2
bunzip2 cmu_us_rms_arctic.tar.bz2
tar -xvf cmu_us_bdl_arctic.tar
tar -xvf cmu_us_clb_arctic.tar
tar -xvf cmu_us_rms_arctic.tar

As Sphinx does not want any accent, we have to clean them up. Why not with this short Python script?

import sys
import os

if len(sys.argv) < 2:
    print("Usage : python3 genFile.py filename")

newf = open("training.transcription","w")
newt = open("trainingTitle.fileids","w")
fichier = open(sys.argv[1], "r")
nline = str()
num = str()

for line in fichier:
    num = line[2:14]
    nline = line[15:].replace(" )","").replace("\"","").replace(",","").replace(".","").replace(";","").replace("--","").replace("\n","")
    newf.write("<s> {line} </s> ({nom})\n".format(line=nline.lower(),nom=num))

os.rename("training.transcription", num+".transcription")
os.rename("training.fileids", num+".fileids")
python3 config-generation.py cmu_us_bdl_arctic/etc/txt.done.data
poddingue@orangepi4-armbian:~/en-us$ python3 config-generation.py cmu_us_bdl_arctic/etc/txt.done.data
poddingue@orangepi4-armbian:~/en-us$ cat config-generation.py
import sys
import os

if len(sys.argv) < 2:
    print("Usage : python3 genFile.py filename")

newf = open("training.transcription","w")
newt = open("training.fileids","w")
fichier = open(sys.argv[1], "r")
nline = str()
num = str()

for line in fichier:
    num = line[2:14]
    nline = line[15:].replace(" )","").replace("\"","").replace(",","").replace(".","").replace(";","").replace("--","").replace("\n","")
    newf.write("<s> {line} </s> ({nom})\n".format(line=nline.lower(),nom=num))

os.rename("training.transcription", num+".transcription")
os.rename("training.fileids", num+".fileids")
python3 config-generation.py cmu_us_bdl_arctic/etc/txt.done.data

So I now have:

-rw-r--r--  1 poddingue poddingue     76958 Jun  8 13:10  txt.done.data
-rw-r--r--  1 poddingue poddingue     14703 Jun  8 13:15  trainingTitle.fileids
-rw-r--r--  1 poddingue poddingue       688 Jun  8 13:18  config-generation.py
drwxr-xr-x 13 poddingue poddingue      4096 Jun  8 13:18  ..
-rw-r--r--  1 poddingue poddingue     81004 Jun  8 13:18  arctic_b0539.transcription
-rw-r--r--  1 poddingue poddingue     14703 Jun  8 13:18  arctic_b0539.fileids

In the .fileids we’ll find the name of the files:

head trainingTitle.fileids

And in the .transcription, we’ll find the text corresponding to the pronounced sentence:

head arctic_b0539.transcription
<s> author of the danger trail philip steels etc </s> (arctic_a0001)
<s> not at this particular case tom apologized whittemore </s> (arctic_a0002)
<s> for the twentieth time that evening the two men shook hands </s> (arctic_a0003)
<s> lord but i'm glad to see you again phil </s> (arctic_a0004)
<s> will we ever forget it </s> (arctic_a0005)
<s> god bless 'em i hope i'll go on seeing them forever </s> (arctic_a0006)
<s> and you always want to see it in the superlative degree </s> (arctic_a0007)
<s> gad your letter came just in time </s> (arctic_a0008)
<s> he turned sharply and faced gregson across the table </s> (arctic_a0009)
<s> i'm playing a single hand in what looks like a losing game </s> (arctic_a0010)

Get the sample acoustic model

Let’s copy the dictionary and the language model for testing:

cp -a /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict .
cp -a /usr/local/share/pocketsphinx/model/en-us/en-us.lm.bin .

Let’s copy the wav files in the current directory:

cp -vraxu cmu_us_bdl_arctic/wav/arctic_* .

Now let’s add sphinxtrain binaries directory in the path:

export PATH=/usr/local/libexec/sphinxtrain:$PATH

We now have everything we need to generate files used in the training.

Generating acoustic feature files

To run the adaptation tools, we have to generate a set of acoustic model features files from the wav recordings.

sphinx_fe -argfile feat.params -samprate 16000 -c arctic_b0539.fileids -di . -do . -ei wav -eo mfc -mswav yes

Start the adaptation

To start the adaptation, you have to launch this command, but pay attention to the generated file names in the last two lines :

bw  -hmmdir .  -moddeffn mdef  -ts2cbfn .ptm.  -feat 1s_c_d_dd  -svspec 0-12/13-25/26-38  -cmn current  -agc none  -dictfn cmudict-en-us.dict  -ctlfn arctic_b0539.fileids  -lsnfn arctic_b0539.transcription 

Generate the new adapted model

First of all, we have to backup the model with the commandcp -a en-us en-us-adapt and then launch this command :

bw  -hmmdir en-us  -moddeffn en-us/mdef  -ts2cbfn .cont. -feat 1s_c_d_dd  -cmn current  -agc none  -dictfn en-us/cmudict-en-us.dict  -ctlfn arctic_b0539.fileids  -lsnfn en-us/arctic_b0539.transcription

Tester le nouveau modèle

Pour tester le nouveau modèle il suffit de lancer une commande avec un fichier audio en paramètre et le retour ira dans un fichier txt. Il faut aussi avoir dans le dossier le fichier en-us.lm.bin correspondant au language model et cmudict-en-us.dict qui correspond au dictionnaire.

  • pocketsphinx_continuous -hmm en-us-adapt -lm en-us.lm.bin -dict cmudict-en-us.dict -infile test.wav > test.txt

Get the samples in the right format

To know if Sphinx will be able to process our files in the future, we’ll feed him with audio extracted from our previous videos. We’ve done it in two steps, but it could have been done in just one. First step is to ask ffmpeg to extract the audio from the videos. If you know your audio track is in wav format, then issue this command:

ffmpeg -i input-video.avi -vn -acodec copy output-audio.wav	

If you know all of your files contain mp3, just change wav into mp3 and do it on all the videos:

for i in *; do ffmpeg -i "$i" "${i%.*}.mp3"; done

If you don’t know which format your audio is, time to issue a ffprobecommand:

ffprobe -show_streams -select_streams a yourvideofile.extension
Stream #0:1(fra): Audio: aac (LC), 44100 Hz, stereo, fltp (default)

Here for example, we have an AAC stream at 44KHz in stereo. We could then extract it as is with

ffmpeg -i input.mp4 -vn -c:a copy output.m4a

Got it? Great. Now, as you have extracted all the audio from your videos, it’s time to convert them to the format Sphinx wants to use:

for f in *.mp3; do ffmpeg -i "$f" -acodec pcm_s16le -ac 1 -ar 16000 "${f%.mp3}.wav"; done

Push them on the OrangePi 4B with:

scp *wav poddingue@

