shithub: opus

--- a/dnn/datasets.txt

+++ b/dnn/datasets.txt

@@ -1,8 +1,6 @@

-The following datasets can be used to train a language-independent LPCNet model.

-A good choice is to include all the data from these datasets, except for

-hi_fi_tts for which only a small subset is recommended (since it's very large

-but has few speakers). Note that this data typically needs to be resampled

-before it can be used.

+The following datasets can be used to train a language-independent FARGAN model

+and a Deep REDundancy (DRED) model. Note that this data typically needs to be

+resampled before it can be used.

 https://www.openslr.org/resources/30/si_lk.tar.gz

 https://www.openslr.org/resources/32/af_za.tar.gz

@@ -61,7 +59,6 @@

 https://www.openslr.org/resources/83/welsh_english_male.zip

 https://www.openslr.org/resources/86/yo_ng_female.zip

 https://www.openslr.org/resources/86/yo_ng_male.zip

-https://www.openslr.org/resources/109/hi_fi_tts_v0.tar.gz

 The corresponding citations for all these datasets are:

@@ -164,10 +161,3 @@

     doi = {10.21437/Interspeech.2020-1096},

     url = {http://dx.doi.org/10.21437/Interspeech.2020-1096},

-@article{bakhturina2021hi,

-  title={{Hi-Fi Multi-Speaker English TTS Dataset}},

-  author={Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},

-  journal={arXiv preprint arXiv:2104.01497},

-  year={2021}

-}

--- a/dnn/torch/rdovae/README.md

+++ b/dnn/torch/rdovae/README.md

@@ -8,24 +8,31 @@

 ## Data preparation

+First, fetch all the data from the datasets.txt file using:

+```

+./download_datasets.sh

+```

+Then concatenate and resample the data into a single 16-kHz file:

+```

+./process_speech.sh

+```

+The script will produce an all_speech.pcm speech file in raw 16-bit PCM format.

 For data preparation you need to build Opus as detailed in the top-level README.

 You will need to use the --enable-dred configure option.

 The build will produce an executable named "dump_data".

 To prepare the training data, run:

```

-./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm

+./dump_data -train all_speech.pcm all_features.f32 /dev/null

```

-Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz.

-The speech data used for training the model can be found at:

-https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw

-The out_speech.pcm file isn't needed for DRED, but it is needed to train

-the FARGAN vocoder (see dnn/torch/fargan/ for details).

 ## Training

 To perform training, run the following command:

```

-python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir

+python ./train_rdovae.py --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 all_features.f32 output_dir

```

 The final model will be in output_dir/checkpoints/chechpoint_400.pth.

--- /dev/null

+++ b/dnn/torch/rdovae/download_datasets.sh

@@ -1,0 +1,6 @@

+mkdir datasets

+cd datasets

+for i in `grep https ../../../datasets.txt`

+do

+	wget $i

+done

--- /dev/null

+++ b/dnn/torch/rdovae/process_speech.sh

@@ -1,0 +1,7 @@

+#!/bin/sh

+cd datasets

+#parallel -j +2 'unzip -n {}' ::: *.zip

+find . -name "*.wav" | parallel -k -j 20 'sox --no-dither {} -t sw -r 16000 -c 1 -' > ../all_speech.sw

--

⑨