1

I'm trying to write a program that counts all the characters in a string at Turkish language. I can't see why this does not work. i added library, setlocale(LC_ALL,"turkish") but still doesn't work. Thank you. Here is my code: My file character encoding: utf_8

int main(){

    setlocale(LC_ALL,"turkish");
    char string[9000];
    int c = 0, count[30] = {0};
    int bahar = 0;    

    ...
        if ( string[c] >= 'a' && string[c] <= 'z' ){
            count[string[c]-'a']++;
            bahar++;

}

my output:

a 0.085217 b 0.015272 c 0.022602 d 0.035736 e 0.110263 f 0.029933 g 0.015272 h 0.053146 i 0.071167 k 0.010996 l 0.047954 m 0.025046 n 0.095907 o 0.069334 p 0.013745 q 0.002443 r 0.053451 s 0.073916 t 0.095296 u 0.036958 v 0.004582 w 0.019243 x 0.001527 y 0.010996

This is English alphabet but i need this characters calculate too: "ğ,ü,ç,ı,ö"

jekyll
  • 51
  • 1
  • 10
  • You need an unicode library that handle UTF8/UTF16/UTF32. – Stargateur Dec 28 '16 at 00:26
  • 1
    ... And check that you can open the file – Ed Heal Dec 28 '16 at 00:26
  • How can i fix then? @stargateur – jekyll Dec 28 '16 at 00:27
  • my file include "ğ,ü,ç,ı,ö" characters @EdHeal but doesn't calculate them. – jekyll Dec 28 '16 at 00:28
  • @jekyll Do some search yourself, http://stackoverflow.com/questions/313555/light-c-unicode-library. – Stargateur Dec 28 '16 at 00:30
  • You would need `wchar_t` for reading – Ed Heal Dec 28 '16 at 00:30
  • @EdHeal thank you, but i don't know how it use this function. i'll search thanks again. i will change the whole code? :( – jekyll Dec 28 '16 at 00:33
  • It is a data type – Ed Heal Dec 28 '16 at 00:34
  • i added this @EdHeal `#include ` `wchar_t string[9000];` and i i got error this line : `if ( fgets(string, 9000, plain) != NULL) { puts(string);` – jekyll Dec 28 '16 at 00:39
  • Perhaps this would help http://www.cplusplus.com/reference/cwchar/fgetws/ or https://linux.die.net/man/3/fgetws – Ed Heal Dec 28 '16 at 00:41
  • @EdHeal i fix it thanks but still doesnt count my characters :( `if ( fgetws(string, 9000, plain) != NULL) { fputws(string, plain);` – jekyll Dec 28 '16 at 00:46
  • 2
    Did you check and print the return value from `setlocale()`? Is `"turkish"` a valid locale string? (I use `en_US.UTF-8` by default: I'd expect you to be using a code such as `tr_TR.UTF-8` or `tr-TR.ISO8859-9` or something vaguely similar — both those locales exist on macOS Sierra, at least on my machine.) – Jonathan Leffler Dec 28 '16 at 00:59
  • @JonathanLeffler thanks for reply, i changed this line `setlocale(LC_ALL,"en_US.UTF-8");` and i tried your suggestions but i still calculate just english characters :((( "ğ" didn't count it. – jekyll Dec 28 '16 at 01:02
  • I know Windows in the past used locale names like `Turkish_Turkey.1254` instead of something like `tr_TR.ISO8859-9`, though more recent editions allow you to use `tr-TR.1254`. Please edit your question to include both the system you're executing the code on and the character encoding of your file, so we can provide more accurate answers. If you're uncertain of the characer encoding, you can upload the file to a [character encoding detector](https://nlp.fi.muni.cz/projects/chared/) to obtain this information. –  Dec 28 '16 at 01:19
  • if change the this line maybe problem solves. `( string[c] >= 'a' && string[c] <= 'z' )` any ideas? because program just see english characters in this line. @EdHeal @JonathanLeffler – jekyll Dec 28 '16 at 01:19
  • @ChronoKitsune thanks for reply, my file character encoding: Detected by Chared: utf_8 – jekyll Dec 28 '16 at 01:21
  • 2
    The `en_US.UTF-8` string is for US English; `en_GB.UTF-8` for British English; both use the UTF-8 code set. Note that your test using `if (string[c] >= 'a' && string[c] <= 'z')` only detects unaccented letters in the basic ASCII (lower-case) range. You'd need to use `isalpha()` from `` to detect alphabetic characters outside the basic ASCII range. You then have to map those to appropriate indexes to count them properly. This is some of what Chrono Kitsune does with their answer. It is hard work dealing with such characters. Knowing the code set (which you found is UTF-8) is crucial. – Jonathan Leffler Dec 28 '16 at 04:19

3 Answers3

2
setlocale(LC_ALL,"turkish");

First: "turkish" isn't a locale.

The proper name of a locale will typically look like xx_YY.CHARSET, where xx is the ISO 639-1 code for the language, YY is the ISO 3166-1 Alpha-2 code for the country, and CHARSET is an optional character set name (usually ISO8859-1, ISO8859-15, or UTF-8). Note that not all combinations are valid; the computer must have locale files generated for that specific combination of language code, country code, and character set.

What you probably want here is setlocale(LC_ALL, "tr_TR.UTF-8").


if ( string[c] >= 'a' && string[c] <= 'z' ){

Second: Comparison operators like >= and <= are not locale-sensitive. This comparison will always be performed on bytes, and will not include characters outside the ASCII a-z range.

To perform a locale-sensitive comparison, you must use a function like strcoll(). However, note additionally that some letters (including the ones you're trying to include here!) are composed of multi-byte sequences in UTF-8, so looping over bytes won't work either. You will need to use a function like mblen() or mbtowc() to separate these sequences.

  • Thanks for reply, i tried this `setlocale(LC_ALL, "tr_TR.UTF-8")` but still calculate just english characters. I can't see why this does not work. – jekyll Dec 28 '16 at 01:06
  • thanks but this it will change whole code? i don't know how it work this functions strcoll() and others, i'll search it. Thanks. – jekyll Dec 28 '16 at 01:11
  • Yes, you will have to change your program rather substantially to support UTF-8 text. –  Dec 28 '16 at 01:45
2

Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:

  1. If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8") or something similar should work, but the important part is the UTF-8 at the end! The language shouldn't matter. You can verify it worked by using

    if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
        abort();
    }
    

    That will stop the program from executing. Anything after that code means that the locale was set correctly.

  2. If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8"). However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.

  3. If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8 locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv() function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.

Once you're ready to read the file, you can use fgetws with a wchar_t array.

Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:

// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";

const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
    count[(size_t)(lchptr-lower)]++;
    bahar++;
} else if (uchptr) {
    count[(size_t)(uchptr-upper)]++;
    bahar++;
}

That code assumes you're counting characters without regard for case (case insensitive). That is, ı (\u0131) and I are considered the same character (count[8]++), just like İ (\u0130) and i are considered the same (count[29]++). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.

Edit

As @JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha (or in this case, iswalpha) on each character in string instead of the lower and upper strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count array (i.e. lower[29] will mean count[29]++ is executed, and upper[18] will mean count[18]++ is executed).

  • thank you, i'm on Linux (Mac OS X) and using Xcode. First i added this line `wchar_t string[9000];` and `if ( fgetws(string, 9000, plain) != NULL) { fputws(string, plain);}` no problem here. i added your codes above, but if i do not write this code all lines gets error : `if (setlocale(LC_CTYPE, "en_US.UTF-8")) != NULL) { abort();}` error is: expected expression. if i delete this then i got errors:`count[chptr-1]++;` **Array subscript is not an integer. ** and i got error this lines too: `chptr -= (lower-1);` incompatible integer to pointer conversion const wchar_t* (aka const int*)from long – jekyll Dec 28 '16 at 09:36
  • @jekyll I amended my answer. The `setlocale` line had an extra closing parentheses, and I've hopefully fixed everything else you mentioned. –  Dec 28 '16 at 15:06
  • thank you, actually i found another way but this time my output is octal :D \347 0.006665 :D :))) \347 is ç. – jekyll Dec 28 '16 at 17:03
  • Thank you i got it. I wondered how to count two letters? bigram. i mean, bigram counts count the frequency of pairs of characters. – jekyll Dec 28 '16 at 19:20
  • @jekyll That should be a new question, but try searching for an answer before asking. Someone probably already asked a similar question. –  Dec 28 '16 at 20:11
0

The solution depends on the character encoding of your files.

If the file is in ISO 8859-9 (latin-5), then each special character is still encoded in a single byte, and you can modify your code easily: You already have a distiction between upper case and lower case. Just add more branches for the special characters.

If the file is in UTF-8, or some other unicode encoding, you need a multi-byte capable string library.

Ludwig Schulze
  • 2,155
  • 1
  • 17
  • 36
  • so i added this `#include ` `wchar_t string[9000];` and i i got error this line : `if ( fgets(string, 9000, plain) != NULL) { puts(string);` still doesn't work. – jekyll Dec 28 '16 at 00:40
  • And what is the character encoding of your file? – Ludwig Schulze Dec 28 '16 at 00:41
  • `setlocale(LC_ALL,"turkish");` @LudwigSchulze – jekyll Dec 28 '16 at 00:42
  • 2
    Absolutely not! What is the output of "file plaintext.txt" – Ludwig Schulze Dec 28 '16 at 00:44
  • 1
    No way! I'm not asking for the file's contents, but for its character encoding. Read https://en.wikipedia.org/wiki/ISO/IEC_8859-9, https://en.wikipedia.org/wiki/UTF-8, and learn your problem domain! – Ludwig Schulze Dec 28 '16 at 00:52
  • my file character encoding: utf_8 @LudwigSchulze – jekyll Dec 28 '16 at 01:26
  • Is this a guess or do you know for sure? How have you found out it's utf-8? Because, for accurate counting, you have to be 100% sure what the character encoding of the file is. Note that you can have files with very different character encodings on the same system. – Ludwig Schulze Dec 28 '16 at 01:35
  • not guess, i'm sure %100 because i checked it here: [link](https://nlp.fi.muni.cz/projects/chared/) – jekyll Dec 28 '16 at 01:49
  • @jekyll The encoding of a file is chosen by the writer. Detection is guessing with rules and probabilities. _The writer has the responsibility of ensuring the reader knows the encoding._ (If someone is providing you the files, you should just ask.) So, you might know that it is UTF-8 and saying that a detection tool agrees supports that claim but if you are only using the tool it still just a guess and in many cases one sample of a file is consistent with many encoding but perhaps not in a future version of the file. – Tom Blodget Dec 28 '16 at 23:35