About Our Spoken Corpus

The Spoken corpus of Hong Kong learners of Mandarin aims to provide high-quality recordings of Hong Kong college students. Our spoken corpus contains 10-12 hours recording data with phonological annotations that focus on 2 areas of segmental features (vowels and consonants), 2 areas of suprasegmental features (tones and retroflex final), and mispronunciation.

The current corpus has the following characteristics:

  1. It provides high-quality recordings that are ideally suited for phonetic and acoustic analysis by researchers around the world.
  2. It produces recordings and phonological annotations that are easily accessible and immediately available to all learners, teachers and researchers, both in and outside The EdUHK.
  3. It provides a platform for learners to discover the linguistic features on their own and to enhance their active engagement in their own learning.
  4. It describes the distinctive linguistic features of Mandarin production from Hong Kong university students.

With the corpus at the core, a Corpus-based Mandarin Pronunciation Learning System was developed consisting of three interrelated components: a) online pronunciation practices specifically designed for Hong Kong learners, b) useful resources for Mandarin pronunciation learning and teaching, including an introduction of Chinese phonology, recommended learning websites, online videos, and online dictionaries, and c) a Praat beginners’ manual for learners to practice Mandarin pronunciation with hands-on acoustic analysis techniques. For more information, please visit https://corpus.eduhk.hk/mandarin_pronunciation/.

Recording Materials

There are two kinds of data in the current corpus: reading tasks and free speech. The three reading tasks last for around 5 minutes, and the majority of free speeches last for 2.5 minutes. The reading tasks and free speech are designed by a Mandarin language expert.

Reading tasks & Free speech

Recording Conditions

In order to eliminate background noise and thereby facilitate acoustic/phonetic measurement, all recordings were made directly onto the computer using the recording software “Audacity” in the language lab of the Hong Kong Institute of Education. The laboratory is quiet but not soundproofed. In order to ensure a high-quality recording, we set the sampling rate as 44.1 kHz which has been recommended by the Audio Engineering Society for Compact Disc (CD) and also most commonly used with MPEG-1 audio (VCD, SVCD, MP3). The recordings were saved using the standard .WAV format to ensure high quality, but they were converted to MP3 when they were uploaded to the website for easier storage and transfer. For those who are interested in downloading the recordings and doing their own research. Please download the format conversion software to convert the sound format to meet your specific needs.

Fellow researchers are welcome to use these recordings for research purposes with suitable acknowledgement.

Subjects

40 Hong Kong tertiary students were recruited, all of which are native speakers of Cantonese. All the subjects in this corpus are college students aged from 18-28 in the Hong Kong Institute of Education. They come from different majors including both English majors, like English education and English literature, and non-English majors like information technology, visual arts, and Chinese education. The majority of the subjects are studying for their bachelor’s degrees.

Recording Analysis

For reading tasks, the recording is “feature” tagged according to a feature marking scheme which contains both suprasegmental level features and segmental level features.

Two experienced Mandarin speakers were invited to make the annotations for the recordings by listening to hem very carefully and referring to the feature marking scheme. One of the annotators is a native speaker of Cantonese, and achieved 2A level in the National Putonghua Exam. The other annotator is a native speaker of Mandarin, and achieved 1B level in the National Putonghua Exam. When there is any inconsistency between the two annotators, a third annotator, who is a native speaker of Mandarin, would decide the final annotation.

The annotators were strictly trained for 2 months in a trail analysis with 20 sets of similar data from July 2015 to September 2015.