Dictionary - What is the dictionary that comes with this software?

The dictionary displayed in the popup window is CC-CEDICT from MDBG, which we have declared in the page http://www.nlptool.com/manual.html on our website. CC-CEDICT is an achievement of collaborative effort over the Internet with users providing entries and corrections. Its advantage is that it is continually updating as new words and expressions enter Chinese language all the time, while its disadvantage is that it is less professional than printed Chinese-English dictionaries such as:

1) Chinese-English Dictionary (by Foreign Language Teaching and Research Press) 《汉英词典》
2) ABC Chinese-English Comprehensive Dictionary (by University of Hawai'i Press) 《ABC 汉语大词典》
3) Oxford Concise English-Chinese, Chinese-English Dictionary (by Oxford University Press) 《牛津精选英汉/汉英词典》
4) Tuttle Learner's Chinese-English Dictionary (by Periplus Editions (HK) Ltd) 《塔托初学者汉英词典》

Unlike simple dictionary apps, SmartCR has more functionalities and thus requires more linguistics resources other than a Chinese-English dictionary to support these functionalities.

The example sentence search involves a Chinese-English bilingual corpus with 5 million sentence pairs. When you double-click a word in a text, example sentences for this word are retrieved from the corpus instantly. That is why our software is big in size (250MB). Each example sentence is a Chinese-English sentence pair, in which the Chinese sentence is original and the English sentence is its translation by human translator. Because sentences were originally conceived in Chinese, it ensures the authenticity of Chinese word usages. A wealth of example sentences makes up for the weakness of CC-CEDICT greatly.

This bilingual corpus is also used to train our translation model (the core of the phrase-based statistical machine translation technology). In other words, we computed the statistics of Chinese-English phrase mapping based on 5-milion Chinese-Enlgish sentence pairs. It should be admitted that although a 5-million bilingual corpus is huge, it is smaller compared with what Google translation has. So our translation is not as good as Google's, there are usually grammar mistakes in the resulting English sentences. But joined with the feature of Chinese-English phrase mapping, they are sufficient for understanding the meanings of Chinese words and even phrases. And our translation works offline, the translation model is saved in your system.

As for Chinese word segmentation, its accuracy is dependent on the precision of the statistics of facts on two levels: first, character-character combinations in Chinese word constructing, and second, word-word combinations in Chinese sentense construting. We computed the statistics based on one-year articles form 人民日报 (People's Daily, the most authoritative newspaper in China), which had been split into words with part of speech (POS) tags by human editors. These data cover all possible Chinese character-character combinations and word-word combinations to a large extent. Joined with the advanced segmentation algorithm, they lead to the high accuracy in segmenting Chinese words.

Our part of speech (POS) tagging feature makes use of the corpus of People's Daily too.