making a list of traditional Chinese characters from a string

Question

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.

I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:

首映鼓掌10分鐘評語指不及《花樣年華》該片在柏林首映，完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本增減20處趙本山香港戲分被刪在柏林影展放映的《一代宗師》版本教李小龍武功葉問決戰散打王

另一增加的戲分是開場時葉問（梁朝偉飾）

My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.

I have not gotten anywhere near this point yet.

I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.

However, I run into trouble when I try to make a list of each character on a particular line.

I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem. How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

My code looks like the following

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs

wordfile = open('Chinese_example.txt', 'r')

output = open('Chinese_output_python.txt', 'w')

LINES = wordfile.readlines()

Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.

A_LINE = list(LINES[0])

output.write(A_LINE[0])

If you just need a list of chinese characters, then follow the recommended answer's advice in the question you linked to :). If you need to split by chinese WORDS, good luck, only incredibly smart programs can do it. Just so you know - you don't need to split by words to then split by characters. You can just split by characters right away, nothing stops you from doing that :) — Patashu, Feb 10 '13 at 23:26
I need to split by *characters* only. I know Chinese words can be multiple characters long, but I do not need that. However, one of the solutions listed in the article does not work with my knowledge and my situation: list(u"这是一个句子") That code successfully places each of the characters into an element of a list. However, since I am dealing with a variable called LINES[0].... I am not able to use that code successfully. I tried list(u"LINES[0]") but this isn't interpreted as the string of Chinese characters that LINES[0] represents. — grumpy_user_number_35, Feb 10 '13 at 23:32
Then just copy the code from the accepted answer at http://stackoverflow.com/a/3798790/497106 and you are done :) — Patashu, Feb 10 '13 at 23:33
Not sure how to do that. What can I do to get that very simple code: list(u"这是一个句子") work in my situation... where instead of the following as my string,这是一个句子, I have LIST[0] ?? — grumpy_user_number_35, Feb 10 '13 at 23:41

score 0 · Answer 1 · edited May 23 '17 at 12:12

0

I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :

from re import compile as _Re

_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split

def split_unicode_chrs( text ):
  return [ chr for chr in _unicode_chr_splitter( text ) if chr ]

edited May 23 '17 at 12:12

Community

1
1

answered Feb 10 '13 at 23:43

Patashu

21,443
3
45
53

How might this be applied in my context with my string being LINES[0]? an element of a larger list? I've copied that text into my code? – grumpy_user_number_35 Feb 10 '13 at 23:46
I really do not have any idea how to implement this solution, after I've added this function how would I call it, and have it split my LINES[0]? – grumpy_user_number_35 Feb 11 '13 at 00:01
If you use split_unicode_chrs on a string it will split it into a list of unicode characters. – Patashu Feb 11 '13 at 00:05
hmm. Ok so, in my case would that look like the following: mynewlist = split_unicode_chrs(LINE[0]) ??? – grumpy_user_number_35 Feb 11 '13 at 00:45

score 0 · Accepted Answer · answered Feb 12 '13 at 20:44

0

to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.

my_new_list = list(unicode(LINE[0].decode('utf8')));

answered Feb 12 '13 at 20:44

grumpy_user_number_35

1
1
2

making a list of traditional Chinese characters from a string

2 Answers2

Linked