Essays24.com - Term Papers and Free Essays
Search

Corpora For Written Chinese:

Essay by   •  May 22, 2011  •  2,104 Words (9 Pages)  •  1,297 Views

Essay Preview: Corpora For Written Chinese:

Report this essay
Page 1 of 9

Corpora for Written Chinese:

an Investigation into its Availability

Abstract

This report will investigate the availability of corpora for Chinese language. The first part is a brief introduction to the history and development of Chinese corpora. The second part will specially introduce the current situation of corpora for mandarin Chinese, including a list of such corpora existed today. Then it moves to the third part, a deeper investigation into three of the chosen corpora, introducing their purposes, contents, makers, availability, formatting, annotation, etc. The fourth part will illustrate what kinds of corpora are in the making, and finally a conclusion will be made on what kinds of corpora still need to be built in the future.

Key words: corpus, mandarin Chinese, availability, PeopleЎЇs Daily corpus, HSK corpus, LIVAC corpus

1. Introduction: Chinese Corpora: its History

When talking about corpora, one may easily reflect the Brown corpus, the LOB corpus, the London-Lund corpus, etc., most of which are English Corpora. If we emphasize corpora of ÐŽoother languagesÐŽ±, we may know the Swedish SUC corpus, the RWC Japanese corpus, but what about Chinese corpora? It seems little investigation has been made on corpora of this language in western countries. How about the development of corpus linguistics in China? Is it well developed? Is there many Chinese corpora available? Which is the largest and most famous one? And what about the details of these Chinese corpora? So many questions haunted in our minds. This assignment will try to seek answers to the related questions.

The first Chinese Corpus might be the one called ÐŽoApplied Glossary of Modern ChineseÐŽ±, which was created in the 1920s, and itЎЇs not a machine-readable one. Chen Heqin, the maker of this corpus, collected about five hundred thousand Chinese words in his work, and aimed to use it in designing the textbook of Chinese language in primary school. (Feng Zhiwei, 2002, p.3-4). Computer readable corpus in China was designed from 1979, and now it has already became a significant research field in linguistic studies.(Feng Zhiwei, 2002, p.5). Not only widely applied in lexicography, language teaching and machine translation, corpus linguistics in China is also a main studying subject in colleges and various academic institutions.(Journal of Chinese Language and Computing, 11(2) 125)

One can divide the existed corpora in China into three types: Chinese corpora, English corpora, and parallel corpora. This report will focus on corpora for mandarin Chinese.

2Ј®Corpora for Mandarin Chinese: Current Situation

Basically there are at least 12 relatively large-sized Chinese corpora existed nowadays, which basic information can be found in the following table. Among them, the ÐŽoNational Balanced CorpusÐŽ± is the largest tagged corpus. In Part 3, I will make a deeper investigation on the ÐŽoPeopleЎЇs Daily CorpusÐŽ± (Corpus for common use), the ÐŽoLIVAC CorpusÐŽ± (Corpus for comparative studies), and the ÐŽoHSK Open CorpusÐŽ± (Corpus for educational use).

Table 1: Existed Corpora for Mandarin Chinese

Name Maker Year Size(million)

Corpus for Contemporary Literature Wuhan University 1979 5.27

Corpus for Modern Chinese Beihang University 1983 20

Chinese Corpus for Middle-School textbook Beijing Normal University 1983 1.068

Word Frequency Counting Corpus Beijing Language & Culture University 1983 1.82

HSK Open Corpus Beijing Language & Culture University Open corpus 10

National Balanced Corpus China Language Committee unfinished 70

Corpus of People's Daily Peking University 1998 27

China News Corpus Shanxi University 1988 2.5

Untagged Corpus Shanghai Normal University Open corpus 30

Corpus of Writer's Digest Shanghai Normal University Open corpus 1

LIVAC(Linguistic Variety in Chinese Communities) City University of Hong Kong 2005 15

Sinica Corpus & Sinica Treebank Taiwan Central Research Institute -- 5

Translated from http://blog.csdn.net/yujun00/archive/2005/12/01/541633.aspx

3. Investigation: three chosen corpora

3.1. Corpus of ÐŽoPeopleЎЇs DailyÐŽ±

PeopleЎЇs Daily is a official Newspaper belongs to the Central Committee of the Communist Party of China. With a circulation of 3 to 4 million, it is one of the most important newspapers in Mainland. (http://en.wikipedia.org/wiki/People's_Daily). ÐŽoPeopleЎЇs dailyÐŽ± corpus contains 2.7 million word of Mandarin Chinese, and it is released by the Institute of Computational Linguistics, Peking University. This corpus contains one yearЎЇs data from People's Daily in 1998, and is divided into four categories shown below:

Cited and translated from http://icl.pku.edu.cn/icl_groups/corpus/shengming.htm

The purpose of building such a corpus is mostly technically and experimentally oriented. It aims to study and improve the technology of tokenization, grammatical tagging, and the method of deciding the pronunciation of polyphone Chinese characters by combining the intelligence of technicians from Peking University, and the updated technology of Fujitsu Corporation. (http://icl.pku.edu.cn/icl_groups/corpus/shengming.htm, May 19, 2007)

Being in a format of text file, the disc of this corpus was released in 2003. The price for one computer entry is 10,100 RMB (equal to 10,100 SEK), and 59,200RMB (equal to 59,200RMB) for multiple entry. Of course it is not cheap for personal users, but fortunately most of the linguistic research institutes and universities in China have bought the PeopleЎЇs Daily Corpus, and can be accessed freely by researchers and teachers in their respective institutes. A small part of this corpus is also available at http://icl.pku.edu.cn/default_en.asp, which contains

...

...

Download as:   txt (16 Kb)   pdf (188.9 Kb)   docx (15.7 Kb)  
Continue for 8 more pages »
Only available on Essays24.com