Left navigation | Content | Right navigation

Chinese word segmentation tool & POS tagging

A development that improves efficiency of chinese text mining applications

 
 Definition

This tool is for chinese data mining and language processing developers, such as search engine providers. It makes part-of-speech tagging - POS tagging - in chinese texts easier, thanks to a specific segmentation of sentences.


 
 Highlights

Main features
Chinese word segmentation and part-of-speech tagging, including Chinese named entity recognizer.
 
Results
It segments Chinese sentence into several Chinese words, and assigns each word a unique part-of-speech tag

Input: 我是一个学生。

Output: 我/r 是/v 一/m 个/q 学生/n 。
 
Environment
Windows/Linux, Intel(R) Pentium(R) M Processor 1500MHz/ Intel(R) Xeon(TM) CPU 3.00GHz

 
 Benefits to users

Not just as any text mining tool, it is specifically adapted to chinese language, where the same word can have different meanings, and where the meaning of sentences is totally different according to how they are segmented.

It can be used for any Chinese language processing software, like online Question/Answering, online Machine Translation, or can also be integrated into Internet search engines, etc.


 
 Advantages

This system can combine word segmentation, part-of-speech tagging and named entity identification in a unified way.  
 
It was ranked n°1 among 10 word segmentation systems (best recall rate, best precision rate) at the SIGHAN 2006 Conference in Sydney, Australia.
© France Telecom - Orange 2012