ref: b5aad6a28299bd92939588f23d3ba3dafd5804f2
parent: 55513e81d8f606bd75d0ff773d2144e5f2a732f5
author: Jean-Marc Valin <jmvalin@jmvalin.ca>
date: Tue Mar 18 10:01:52 EDT 2025
Better model building instructions
--- a/dnn/datasets.txt
+++ b/dnn/datasets.txt
@@ -1,8 +1,6 @@
-The following datasets can be used to train a language-independent LPCNet model.
-A good choice is to include all the data from these datasets, except for
-hi_fi_tts for which only a small subset is recommended (since it's very large
-but has few speakers). Note that this data typically needs to be resampled
-before it can be used.
+The following datasets can be used to train a language-independent FARGAN model
+and a Deep REDundancy (DRED) model. Note that this data typically needs to be
+resampled before it can be used.
https://www.openslr.org/resources/30/si_lk.tar.gz
https://www.openslr.org/resources/32/af_za.tar.gz
@@ -61,7 +59,6 @@
https://www.openslr.org/resources/83/welsh_english_male.zip
https://www.openslr.org/resources/86/yo_ng_female.zip
https://www.openslr.org/resources/86/yo_ng_male.zip
-https://www.openslr.org/resources/109/hi_fi_tts_v0.tar.gz
The corresponding citations for all these datasets are:
@@ -164,10 +161,3 @@
doi = {10.21437/Interspeech.2020-1096},
url = {http://dx.doi.org/10.21437/Interspeech.2020-1096},
}
-
-@article{bakhturina2021hi,
- title={{Hi-Fi Multi-Speaker English TTS Dataset}},
- author={Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
- journal={arXiv preprint arXiv:2104.01497},
- year={2021}
-}
--- a/dnn/torch/rdovae/README.md
+++ b/dnn/torch/rdovae/README.md
@@ -8,24 +8,31 @@
## Data preparation
+First, fetch all the data from the datasets.txt file using:
+```
+./download_datasets.sh
+```
+
+Then concatenate and resample the data into a single 16-kHz file:
+```
+./process_speech.sh
+```
+The script will produce an all_speech.pcm speech file in raw 16-bit PCM format.
+
+
For data preparation you need to build Opus as detailed in the top-level README.
You will need to use the --enable-dred configure option.
The build will produce an executable named "dump_data".
To prepare the training data, run:
```
-./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm
+./dump_data -train all_speech.pcm all_features.f32 /dev/null
```
-Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz.
-The speech data used for training the model can be found at:
-https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw
-The out_speech.pcm file isn't needed for DRED, but it is needed to train
-the FARGAN vocoder (see dnn/torch/fargan/ for details).
## Training
To perform training, run the following command:
```
-python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir
+python ./train_rdovae.py --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 all_features.f32 output_dir
```
The final model will be in output_dir/checkpoints/chechpoint_400.pth.
--- /dev/null
+++ b/dnn/torch/rdovae/download_datasets.sh
@@ -1,0 +1,6 @@
+mkdir datasets
+cd datasets
+for i in `grep https ../../../datasets.txt`
+do
+ wget $i
+done
--- /dev/null
+++ b/dnn/torch/rdovae/process_speech.sh
@@ -1,0 +1,7 @@
+#!/bin/sh
+
+cd datasets
+
+#parallel -j +2 'unzip -n {}' ::: *.zip
+
+find . -name "*.wav" | parallel -k -j 20 'sox --no-dither {} -t sw -r 16000 -c 1 -' > ../all_speech.sw
--
⑨