I found ‘Jieba’ word segemtation on Github. Jieba is a powerful and widely-used Chinese text segmentation library in Python, designed to perform efficient and accurate word segmentation in Chinese text. The name “Jieba” (结巴) means “to stutter,” humorously hinting at the process of breaking down continuous Chinese characters into individual words or phrases.
https://github.com/fxsjy/jieba
Unlike in English, where a single word often conveys the intended meaning clearly, Chinese sometimes requires a combination of words to accurately express a precise meaning.
Key Features of Jieba:
• Full Mode: This mode scans through the entire sentence and identifies all possible word combinations that can form words. It’s extremely fast but doesn’t resolve ambiguity between overlapping terms, meaning it may return more words than the actual intended meaning.
• Accurate Mode: In this default mode, Jieba attempts to cut the sentence accurately into words while handling ambiguous phrases based on context. This is slower than full mode but provides more accurate results.
• Search Engine Mode: An enhanced mode that further divides long words into smaller components, making it useful for information retrieval or search engines.
In full mode, if you input a sentence like:
jieba.cut("我爱北京天安门", cut_all=True)
Jieba will output:
['我', '爱', '北京', '天安门', '安', '门']
So I installed Jieba in Python Environment.
The text I choose is the Chinese Version’s Leo Tolstoy’s War and Peace.