科研成果
  概况介绍  
  获奖  
  论文  
  学术报告  
  专著  
  专利  
您现在的位置:首页 > 科研成果 > 论文
论文题目  Identifying Language Origin of Named Entity with Multiple Information Sources 
论文题目(英文) Identifying Language Origin of Named Entity with Multiple Information Sources 
作者 尤佳莉 
发表年度 2008 
16 
6 
页码 1077-1086 
期刊名称 2008IEEE音频、语音和语言处理汇刊 
摘要  
摘要_英文
To identify the language origin of a named entity, morphological information associated with its letter spelling, such as letter N-grams, is commonly employed. However, with this information only, named entities with similar spellings but from different language origins are difficult to differentiate. In this paper, a measure of "popularity," in terms of frequency or page count of the named entity in language-specific web search, is proposed for identifying its language origin. Morphological information, including letter or letter-chunk N-grams, is used to enhance the performance of language identification in conjunction with web-based page counts. Six languages, including English, German, French, Portuguese, Chinese, and Japanese (Chinese and Japanese named entities are shown in their corresponding phonetic alphabets, i.e., Pinyin and Romaji), are tested. Experiments show that when classifying four Latin languages, including English, German, French, and Portuguese, which are written in Latin alphabets, features from different information sources yield substantial performance improvements in the classification accuracy over a letter 4-gram-based baseline system. The accuracy Increases from 75.0% to 86.3%, or a 45.2% relative error reduction.