Since you are apparently working with a UTF-8 file, the answer will depend upon your execution platform:
If you're on Linux, setlocale(LC_CTYPE, "en_US.UTF-8")
or something similar should work, but the important part is the UTF-8
at the end! The language shouldn't matter. You can verify it worked by using
if (setlocale(LC_CTYPE, "en_US.UTF-8") == NULL) {
abort();
}
That will stop the program from executing. Anything after that code means that the locale was set correctly.
If you're on Windows, you can instead open the file using fopen("myfile.txt", "rt, ccs=UTF-8")
. However, this isn't entirely portable to other platforms. It's a lot cleaner than the alternatives, however, which is likely more important in this particular case.
If you're using FreeBSD or another system that doesn't allow you to use either approach (e.g. there are no UTF-8
locales), you'd need to parse the bytes manually or use a library to convert them for you. If your implementation has an iconv()
function, you might be able to use it to convert from UTF-8 to ISO-8859-9 to use your special characters as single bytes.
Once you're ready to read the file, you can use fgetws
with a wchar_t
array.
Another problem is checking if one of your non-ASCII characters was detected. You could do something like this:
// lower = "abcdefghijklmnopqrstuvwxyzçöüğı"
// upper = "ABCDEFGHİJKLMNOPQRSTUVWXYZÇÖÜĞI"
const wchar_t lower[] = L"abcdefghijklmnopqrstuvwxyz\u00E7\u00F6\u00FC\u011F\u0131";
const wchar_t upper[] = L"ABCDEFGH\u0130JKLMNOPQRSTUVWXYZ\u00C7\u00D6\u00DC\u011EI";
const wchar_t *lchptr = wcschr(lower, string[c]);
const wchar_t *uchptr = wcschr(upper, string[c]);
if (lchptr) {
count[(size_t)(lchptr-lower)]++;
bahar++;
} else if (uchptr) {
count[(size_t)(uchptr-upper)]++;
bahar++;
}
That code assumes you're counting characters without regard for case (case insensitive). That is, ı
(\u0131
) and I
are considered the same character (count[8]++
), just like İ
(\u0130
) and i
are considered the same (count[29]++
). I won't claim to know much about the Turkish language, but I used what little I understand about Turkish casing rules when I created the uppercase and lowercase strings.
Edit
As @JonathanLeffler mentioned in the question's comments, a better solution would be to use something like isalpha
(or in this case, iswalpha
) on each character in string
instead of the lower
and upper
strings of valid characters I used. This, however, would only allow you to know that the character is an alphabetic character; it wouldn't tell you the index of your count
array to use, and the truth is that there is no universal answer to do so because some languages use only a few characters with diacritic marks rather than an entire group where you can just do string[c] >= L'à' && string[c] <= L'ç'
. In other words, even when you have read the data, you still need to convert it to fit your solution, and that requires knowledge of what you're working with to create a mapping from characters to integer values, which my code does by using strings of valid characters and the indices of each character in the string as the indices of the count
array (i.e. lower[29]
will mean count[29]++
is executed, and upper[18]
will mean count[18]++
is executed).