3

I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.

The Error message is

Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things  = 125, valid words = 

which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8 pragma is installed.

#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;

my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;

open (FILE, "< $input_file") or die "Can't open $input_file: $!";

while (<FILE>) {
 foreach (split) { #break $_ into words, assign each to $_ in turn
 $total++;
 next if /\W|^\d+/;  #strange words skip the remainder of the loop
 $valid++;
 $count{$_}++;  # count each separate word stored in a hash
 ## next comes here ##
      }
   }

   print "Total things  = $total, valid words = $valid\n";
   foreach my $word (sort keys %count) {
      print "$word \t was seen \t $count{$word} \t times.\n";
   }

##---Data----
sample_file.txt

那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.
Makoto
  • 104,088
  • 27
  • 192
  • 230
Ivan
  • 85
  • 3
  • 11
  • 2
    `use utf8` means the *source code* is in utf8. It does nothing to file IO. – J-16 SDiZ Jan 06 '11 at 03:51
  • 2
    Note also that `use utf8;` tells perl that your script is in utf8; it says nothing about the data your program might manipulate. You need to use the ":utf8" layer on files to do that; see Hugmeir's answer below for how to do that for an already-open filehandle (binmode) or one you're just now open'ing. – jade Jan 06 '11 at 03:59
  • @J-16 SDiZ, jon: You realize that it is actually possible to live in a part of the world where "8 bit ASCII" is not universal, right? And that your computers default locale settings might be something other than "ASCII/UTF8", and you keep your perl source in UTF8 so you can round trip it easily to those nutty "There's another language besides english?" flat-earthers... and perl might, just might, have something like a `PERL_UNICODE` environment variable for helping users who work in non-english speaking parts of the world. And none of this has anything to do with counting words. – johne Jan 06 '11 at 04:57
  • 2
    @johne, sure -- and I live in that part of the world. This question deals with *unicode file i/o*, not the *unicode source code*. Do you know `use encoding 'utf8';` and `use utf8;` do different things? If not, get a modern (> 5.8) perl book and read. – J-16 SDiZ Jan 06 '11 at 05:05
  • @J-16 SDiZ, yes, and you actually checked what `PERL_UNICODE` does, right? And regardless, the OP's question is _not_ about file encodings, but `count the Chinese _words_`, along with a clue that `125 is the number of lines`, which allows for the possibility that the user has the encoding issue under control, but can't count the _words_. If you happen to be on a Mac, it defaults to saving "non-ASCII'ish" files as UTF-16, which means none of this "use UTF8 to read the file" advice is going to get you anywhere, and might actually explain why "use utf8;" is in the source file. – johne Jan 06 '11 at 05:43
  • Unicode in `perl` is ***very*** complicated. It tries to be compatible with it's old way of dealing with things (everything is 8 bit bytes) and bolting "Unicode aware heuristics" on top of that. In fact, many times `perl` will 'automagically` convert raw bytes to Unicode on the fly, in particular when dealing with and manipulating strings. It's is entirely possible that the way the OP is performing the `split` operation (which is in pseudo code) is causing perl to re-interpert the raw bytes as UTF8, and the resulting internal string is already in UTF8 Unicode. – johne Jan 06 '11 at 05:53
  • _Word breaking_ is an extremely non-trivial problem. If your browser has the normal "double click to select an individual word" behavior, try double clicking on parts of `ฉันกินข้าว`. This is Thai, but it goes to show that word breaking _asian_ languages is non-trivial. Just because "One unicode code point equals one word" happens to work in practice for Chinese does not mean it holds in general. If you want to "word break" for the general case, you need the much more powerful techniques that your browser / GUI uses.. which can't easily be duplicated with a regex. – johne Jan 06 '11 at 06:02
  • @johne, I am quiet certain that when OP said "word", s/he means "a single character" since Chinese do not usually count "words" in a sentence the way you suggested. When Chinese wants to mean "word" in the same sense as in English, they might probably use the word "phrase" (a combination of several Chinese characters which has a special and different meaning) . In other words, `word` (字) in Chinese is "equivalent to" `character` in English; `phrase` (詞語) in Chinese is "equivalent to" `word` in English. – cychoi Feb 13 '15 at 11:21
  • [Here is another example of incorrect use of "word" while the OP should say "character"](http://stackoverflow.com/questions/20396456/how-to-do-word-counts-for-a-mixture-of-english-and-chinese-in-javascript). _Disclaimer: I am a native Chinese speaker_ – cychoi Feb 13 '15 at 11:21

2 Answers2

4

We set STDOUT to the :utf8 IO layer so the says won't show malformed the data, then open the file with the same layer so that the diamond won't read malformed data. Afterward, inside the while, rather than splitting on the empty string, we use a regex with the "East_Asian_Width: Wide" Unicode-like property.

utf8 is for my personal sanity checking, and can be removed (Y).

use strict;
use warnings;
use 5.010;
use utf8;
use autodie;

binmode(STDOUT, ':utf8');

open my $fh, '<:utf8', 'sample_file.txt';

my ($total, $valid);
my %count;

while (<$fh>) {
    $total += length;
    for (/(\p{Ea=W})/g) {
        $valid++;
        $count{$_}++;
    }
}

say "Total things  = $total, valid words = $valid";
for my $word (sort keys %count) {
   say "$word \t was seen \t $count{$word} \t times.";
}

EDIT: J-16 SDiZ and daxim pointed out that the chances of sample_file.txt being in UTF-8 are.. slim. Read their comments, then take a look at the Encode module in perldoc, specifically the 'Encoding via PerlIO' portion.

Hugmeir
  • 1,249
  • 6
  • 9
  • +1. Also, if you want to get rid of the warnings on empty lines, use: `my $total = 0;` – jade Jan 06 '11 at 03:53
  • Since the OP clearly states that he is attempting to count the number of words, how exactly does `\p{Ea=W}` accomplish this? I don't read or write chinese, so this would seem to imply to me that Chinese words can not span more than one unicode code point (i.e., character), which seems unlikely. – johne Jan 06 '11 at 04:43
  • 1
    Because we aren't counting 'words.' That doesn't translate well from English to Chinese - A 'word' could be a single character, like 一, or could be a combination thereof, like (from a random dictionary pull) 誓死不降. Moreover, the first two characters from that second example - 誓死 - are also considered a word. You can't count these things, short of pulling a natural language parser or checking with a dictionary for every match. So this does exactly what the OP was doing: Counting characters. – Hugmeir Jan 06 '11 at 04:59
  • Actually, the OP very clearly states they are attempting to count "words", and not "characters". I have a lot of experience dealing with the low level details of Unicode, but very little with Chinese. Can "normalization" significantly alter the number and type of code points `\p{Ea=W}` match? A quick check with my usual unicode tools shows that many characters, for example `只`, have various variants: `止`, `衹隻` (the way the info is displayed in the app it is unclear if these two should be considered separate or together). – johne Jan 06 '11 at 06:34
  • Also, you mention that `short of pulling a natural language parsers...`, which is preciesly what the word breaking engine in ICU is designed to do. Though another commenter has mentioned that the current behavior for Chinese is roughly "break on unicode code points", but it definitely doesn't work this way for Thai and Japanese (i.e., it is a full on natural language parser, using heuristics and dictionaries to do the work). This enhanced word breaking functionality is fully available in ICU regexes: `split(/(?w)\b/,$string)`, for example. – johne Jan 06 '11 at 06:38
  • Also, as you point out, `word doesn't translate well from English to Chinese`. I happen to be a native English speaker, but I have had to do what the OP is asking about "for Asian languages". The OP may be the same as me and not appreciate the difficulties of "word breaking Asian languages with a regex". If this is the case, I'm just raising the possibility that they may be trying to do (or being asked to do) "natural language word breaking" of Chinese text using perls regex, and what might be "obvious" to you may not be "obvious" to them, in particular the reasons why. :) – johne Jan 06 '11 at 06:58
  • How do you know that `sample_file.txt` is encoded in UTF-8? I bet you a beer that it is really in GB18030. – daxim Jan 06 '11 at 10:02
  • @daxim: I don't. Like you and J-16 SDiZ have pointed out, that's one rather glaring oversight - I'll edit in a comment to point the reader to the comments. – Hugmeir Jan 06 '11 at 15:10
2

I may be able to offer some insight, but it's hard to tell if my answer will be "helpful". First, I only speak and read english, so I obviously do not speak or read chinese. I do happen to be the author of RegexKitLite, which is an Objective-C wrapper around the ICU regex engine. This is obviously not perl, :).

Despite this, the ICU regex engine happens to have a feature that sounds remarkably like what it is that you're trying to do. Specifically, the ICU regex engine contains the UREGEX_UWORD modifier option, which can be turned on dynamically via the normal (?w:...) syntax. This modifier performs the following action:

Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.

You can use this in a regex like (?w:\b(.*?)\b) to "extract" words from a string. In the ICU regex engine, it has a fairly powerful "word breaking engine" that is specifically designed to find word breaks in written languages that do not have an explicit space 'character', like english. Again, not reading or writing these languages, my understanding is that "itisroughlysomethinglikethis". The ICU word breaking engine uses heuristics, and occasionally dictionaries, to be able to find the word breaks. It is my understanding that Thai happens to be a particularly difficult case. In fact, I happen to use ฉันกินข้าว (Thai for "I eat rice", or so I was told) with a regex of (?w)\b\s* to perform a split operation on the string to extract the words. Without (?w) you can not split on word breaks. With (?w) it results in the words ฉัน, กิน, and ข้าว.

Provided the above "sounds like the problem you're having", then this could be the reason. If this is the case, then I am not aware of any way to accomplish this in perl, but I wouldn't consider this opinion an authoritative answer since I use the ICU regex engine more often than the perl one and am clearly not properly motivated to find a working perl solution when I've already got one :). Hope this helps.

johne
  • 6,760
  • 2
  • 24
  • 25
  • (0) ICU Chinese word breaking algorithm breaks character ; (1) For most Chinese people, Chinese "word" means the glyph, the character ; (2) For search engines, practically most index engine just group every two character and treat them as "word"; (3) When real "word" segmentation is need, the dictionary approach (the one used in Thai) never work well. – J-16 SDiZ Jan 06 '11 at 04:51
  • For those of you who are down voting my answer, please keep in mind that native English speakers may literally not understand that you can not "word break asian languages like you can ASCII English." In fact, for non-native English speakers, the thought of "word breaking Asian text with a regex" is so absurd that it may not occur to you that someone may actually want to do this and that to them it should be roughly as trivial as `split(/\s*/, $string)`. – johne Jan 06 '11 at 07:12
  • I think Ivan made a mistake, he meant to say "character" but wrote "word"; that's how I read it from the sample code. Despite giving an answer not fitting to the intent of the question, +1 for you because I know it will come in handy for future users arriving via search. – I agree that ICU sucks, luckily there are many other segmentation engines available on CPAN. – daxim Jan 06 '11 at 09:58
  • I do not agree with [two claims that J-16 SDiZ made above](http://stackoverflow.com/q/4611425#comment-5069748). ⑴ My observation is that most Chinese people have acquired the little bit of grammar education which is necessary to distinguish between "character" 字 and "word" 词. ⑶ A dictionary approach simplistically using the longest match works in fact very well, the average error rate is only about 0.5%. – daxim Jan 06 '11 at 10:02
  • @daxim, (1.0) google translate "word" as 字 as first result; (1.2) computer scientist call them "word" (because it make up of multiple character), linguist call them "morpheme" 語素/詞素, because it represent a single meaning (but this may means "part of speech" in some part of China). – J-16 SDiZ Jan 06 '11 at 11:17
  • @daxim, try this for your amusement (and how naive computer word segmentation not work): http://itre.cis.upenn.edu/~myl/languagelog/archives/004808.html – J-16 SDiZ Jan 06 '11 at 11:18