0

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.

I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:

首映鼓掌10分鐘 評語指不及《花樣年華》 該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本 增減20處 趙本山香港戲分被刪 在柏林影展放映的《一代宗師》版本 教李小龍武功 葉問決戰散打王

另一增加的戲分是開場時葉問(梁朝偉飾)

My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.

I have not gotten anywhere near this point yet.

I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.

However, I run into trouble when I try to make a list of each character on a particular line.

I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem. How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?

My code looks like the following

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs

wordfile = open('Chinese_example.txt', 'r')

output = open('Chinese_output_python.txt', 'w')

LINES = wordfile.readlines()

Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.

A_LINE = list(LINES[0])

output.write(A_LINE[0])
Community
  • 1
  • 1
  • If you just need a list of chinese characters, then follow the recommended answer's advice in the question you linked to :). If you need to split by chinese WORDS, good luck, only incredibly smart programs can do it. Just so you know - you don't need to split by words to then split by characters. You can just split by characters right away, nothing stops you from doing that :) – Patashu Feb 10 '13 at 23:26
  • I need to split by *characters* only. I know Chinese words can be multiple characters long, but I do not need that. However, one of the solutions listed in the article does not work with my knowledge and my situation: list(u"这是一个句子") That code successfully places each of the characters into an element of a list. However, since I am dealing with a variable called LINES[0].... I am not able to use that code successfully. I tried list(u"LINES[0]") but this isn't interpreted as the string of Chinese characters that LINES[0] represents. – grumpy_user_number_35 Feb 10 '13 at 23:32
  • Then just copy the code from the accepted answer at http://stackoverflow.com/a/3798790/497106 and you are done :) – Patashu Feb 10 '13 at 23:33
  • Not sure how to do that. What can I do to get that very simple code: list(u"这是一个句子") work in my situation... where instead of the following as my string,这是一个句子, I have LIST[0] ?? – grumpy_user_number_35 Feb 10 '13 at 23:41

2 Answers2

0

I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :

from re import compile as _Re

_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split

def split_unicode_chrs( text ):
  return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
Community
  • 1
  • 1
Patashu
  • 21,443
  • 3
  • 45
  • 53
0

to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.

my_new_list = list(unicode(LINE[0].decode('utf8')));