1

I have written a simple code to count the number of different character in a text.This is the code below:

#include <iostream>
#include <fstream>
#include <map>
using namespace std;
const char* filename="text.txt";
int main()
{
    map<char,int> dict;
    fstream f(filename);
    char ch;
    while (f.get(ch))
    {
        if(!f.eof())
            cout<<ch;
        if (!dict[ch])
            dict[ch]=0;
        dict[ch]++;
    }
    f.close();
    cout<<endl;
    for (auto it=dict.begin();it!=dict.end();it++)
    {
        cout<<(*it).first<<":\t"<<(*it).second<<endl;
    }
    system("pause");
}

The program did well in counting ascii character,but it could not work in Unicode character like chinese character.How to solve the problem if I want it able to work in Unicode character?

罗泽轩
  • 1,603
  • 2
  • 14
  • 19
  • 3
    First of all you are going to need to settle on an encoding. Do you know which encoding you intend to use? And then you need to work out what exactly you mean by "character". – David Heffernan May 20 '13 at 16:23
  • There is no such thing as 'unicode character'. You may refer to utf8everywhere.org for differences between different concepts of characters in unicode, or to the "how twitter counts characters" article for justification of different approaches. In either case, there is little sense in counting code points. – Pavel Radzivilovsky May 21 '13 at 18:29

4 Answers4

2

First off, what do you want to count? Unicode codepoints or grapheme clusters, i.e., characters in the encoding sense, or characters as perceived by the reader? Also keep in mind that "wide characters" (16 bit characters) are not Unicode characters (UTF-16 is variable length just like UTF-8!).

In any case, get a library such as ICU to do the actual codepoint/cluster iteration. For counting you need to replace the char type in your map with an appropriate type (either 32 bit unsigned int for codepoints, or normalized strings for grapheme clusters, normalization should - again - be taken care of by a library)

ICU: http://icu-project.org

Grapheme clusters: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Normalization: http://unicode.org/reports/tr15/

Joe
  • 6,497
  • 4
  • 29
  • 55
  • 1
    Yes. If you want to go beyond codepoints, and treat what the reader would consider a single character, it's a lot more work. You might also consider that most readers would consider `'A'` and `'a'` the same character, or that `'a'` and `'ä'` are the same character in French, but different characters in Swedish. – James Kanze May 20 '13 at 16:41
  • You have that situation in English as well. Although the use of the diaeresis has become less popular, it is still sometimes used in words such as coöperation or naïve. – Joe May 20 '13 at 16:47
  • In German you could even reason that ö should be counted as an o and an e, as it is technically a contraction of those two letters (instead of being a letter in it's own right as in Swedish) – Joe May 20 '13 at 16:50
  • I use it myself in "naïve" and "Noël"; I wasn't aware that any other spellings were acceptable. But Swedish doesn't have a "ë", and I'm not sure about how other languages treat it. (Then of course there's German, where "Ä" and "Ae" are the samme letter: the second what you use on a Swiss German keyboard.) – James Kanze May 20 '13 at 16:51
  • "Naive" is a perfectly acceptable spelling, see for example [link](http://www.merriam-webster.com/dictionary/naive). I myself prefer the version with an "ï" as well. – Joe May 20 '13 at 16:54
  • Yes. And of course, the "ue" in "Muell" counts as one letter, but in "Duell" as two. – James Kanze May 20 '13 at 16:55
1

You need a Unicode library to handle Unicode characters. Coding - say - UTF8 yourself would a harsh task, and reinventing the wheel.

In this Q/A from SO there is a good one mentioned, and you'll find advice from other answers.

Community
  • 1
  • 1
Déjà vu
  • 28,223
  • 6
  • 72
  • 100
  • 1
    In addition to ring0's reference; there is a good explanation in http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring as well – Ahmed Masud May 20 '13 at 16:21
  • For simple things like this, interpreting UTF-8 yourself is fairly simple and straightforward, and you can avoid having to go through all of the conversion work. – James Kanze May 20 '13 at 16:44
  • @cubuspl42 It supports Unicode literal strings, and has data types for the various Unicode formats. But getting your input into the desired internal format requires having the right locale. – James Kanze May 20 '13 at 16:45
0

There are wide char versions of everything, though if you wanted to do something very similiar to what you have now and are using a 16-bit version of unicode:

map<short,int> dict;
fstream f(filename);
char ch;
short val;
while (1)
{
    // Beware endian issues here - should work either way for char counting though.
    f.get(ch);
    val = ch;
    f.get(ch);
    val |= ch << 8;

    if(val == 0) break;

    if(!f.eof())
        cout<<val;
    if (!dict[val])
        dict[val]=0;
    dict[val]++;
}
f.close();
cout<<endl;
for (auto it=dict.begin();it!=dict.end();it++)
{
    cout<<(*it).first<<":\t"<<(*it).second<<endl;
}

The above code makes lots of assumptions (all chars 16-bit, even number of bytes in file, etc.), but it should do what you want or at least give you a quick idea of how it could work with wide chars.

Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
  • Unlucky,there are some chars which is not 16-bit.And the code just prints numbers to screen,though I have used static_cast to change type).I don`t know how to map the number to real characters. – 罗泽轩 May 21 '13 at 11:45
0

If you can compromize and just count code points, it's fairly simple to do directly in UTF-8. Your dictionary, however, will have to be std::map<std::string, int>. Once you've got the first character of a UTF-8:

while ( f.get( ch ) ) {
    static size_t const charLen[] = 
    {
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
          3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
          4,  4,  4,  4,  4,  4,  4,  4,  5,  5,  5,  5,  6,  6,  0,  0,
    } ;
    int chLen = charLen[ static_cast<unsigned char>( ch ) ];
    if ( chLen <= 0 ) {
        //  error: impossible first character for UTF-8
    }
    std::string codepoint( 1, ch );
    -- chLen;
    while ( chLen != 0 ) {
        if ( !f.get( ch ) ) {
            //  error: file ends in middle of a UTF-8 code point.
        } else if ( (ch & 0xC0) != 0x80 ) {
            //  error: illegal following character in UTF-8
        } else {
            codepoint += ch;
        }
    }
    ++ dict[codepoint];
}

You'll note that most of the code is involved in error handling.

James Kanze
  • 150,581
  • 18
  • 184
  • 329