Other Mandarin Corpora

  • Sinica Corpus /中央研究院漢語平衡語料庫 (Taiwan)
    http://asbc.iis.sinica.edu.tw/

    The Sinica Balanced Corpus 4.0 is the first balanced Chinese Mandarin with part-of-speech tagging. Sinica 4.0 includes 10 million words, and each text is categorized and tagged according to five criteria: genre, mode, topic, style and source.

  • Sinorama (Taiwan)
    http://edba.ncl.edu.tw/sinorama/index.htm

    The Sinorama Chinese-English Parallel Text Corpus is composed of 2,373 parallel texts written in Chinese and English published between 1976 and 2000, including 103,252 pairs of sentences.

  • LIVAC Synchronous Corpus (Hong Kong)
    http://www.livac.org/

    LIVAC is a synchronous Chinese corpus that draws and analyzes language data to determine linguistic and other developments in the printed Chinese media in Hong Kong, Taiwan, Beijing, Shanghai, Singapore, and Macau, including more than 550 million characters of news media texts.

Useful websites on corpus linguistics