Kaldi recipe for Mozilla's Common Voice corpus
Note: This is about a recipe for the original, English-only Common Voice corpus released in 2018. Subsequent releases with data for various languages are not compatible with this recipe.
Recently, Mozilla published a first version of their Common Voice corpus. It consists of speech prompts from an unknown number of speakers (no speaker IDs are provided due to privacy concerns, see this forum thread). About 254 hours have been validated by multiple listeners. The data has been split into pre-defined training, development and test sets.
I created an initial Kaldi recipe for the corpus as a research exercise. It has …more ...