Cantonese Audio Dataset

Cantonese is the traditional prestige variety and standard form of Yue Chinese. It is a spoken language that is totally different from Mandarin. In the other words, even though we can find various packages for written Chinese / Mandarin, there are not many resources catering for Catonese.

Today, I want to introduce one of the public Cantonese audio datasets collected in 1997 and 1998. It contains 93 audio recordings and around 230k vocabularies in total. More importantly, the POS-tagging transcripts are also available to download.

You may find the essay here:

If you are interesting in audio preprocessing, please find more details in the following link: