5

I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.

I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.

The program is working with this array of chars:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives these warnings:

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with 'a' it replaces 'å' with "�a".

I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but it gives this error:

excess elements in char array initializer

Is there a a way to make this work without making it too complicated?

Community
  • 1
  • 1
  • Please state the standard version you are using and if you tried with C11 and which source/target character encoding your compiler uses. Note that e.g. `UTF-8` (default for gcc) has variable length characters, so `char` will not be sufficient to hold anything else than ASCII in a single `char` variable. – too honest for this site Sep 21 '15 at 12:19
  • How can I find out which version I'm using? I haven't tried with C11, and I don't know how I would go about doing that. I use this line to compile: > gcc -Wall -g -o filename filename.c – Martin Johansen Sep 21 '15 at 12:24
  • You have to specify yourself. Check the documentation which standard your gcc-version uses by default. (hint: this changed recently). Anyway, you have to use wide chars, but I cannot help you with that - sorry. – too honest for this site Sep 21 '15 at 12:27
  • Instead of `char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};` you should use `char *extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};` – Fernando Silveira Sep 21 '15 at 12:28
  • I'm using gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) I will try that Fernando. – Martin Johansen Sep 21 '15 at 12:36
  • You can select standard version with `-std=c11`. How well that works with version 4.8.4, I don' t know. – Bo Persson Sep 21 '15 at 12:38
  • 1
    Try `char extended[3][3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};` – MikeCAT Sep 21 '15 at 12:38
  • @MartinJohansen I would really change the program to work with UTF8, because of the reasons stated in [UTF-8 Everywhere](http://utf8everywhere.org/), (the link that Basile Starynkevitch already posted in his answer). – alain Sep 21 '15 at 13:24
  • 1
    those characters can't fit in a `char`. You must use `wchar_t`, `char16_t` or `char32_t`. Read more [Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – phuclv Sep 21 '15 at 13:55
  • 1
    @LuuVinhPhuc: No you don't have to use `wchar_t` (whose width vary from one implementation or OS to another), but you should use UTF_ multibyte `char` like I did in my answer. – Basile Starynkevitch Sep 21 '15 at 14:05

2 Answers2

4

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

Community
  • 1
  • 1
Cloud
  • 18,753
  • 15
  • 79
  • 153
  • 1
    I made this work by doing these replacements in my code: char -> wchar_t, strcpy() -> wcscpy(), strlen() -> wcslen(), printf("%s", str) -> printf("%ls", str). I'm only missing a replacement for getline(). – Martin Johansen Sep 21 '15 at 13:07
  • 1
    There are no "extended ASCII characters". "Code pages" are specific to one family of operating systems. There is absolutely no problem whatsoever comparing ASCII with UTF8, as UTF8 is specifically designed to be ASCII-compatible. – n. m. could be an AI Sep 21 '15 at 13:12
  • @n.m. I beg to differ. https://en.wikipedia.org/wiki/Extended_ASCII *Extended ASCII (or high ASCII) is eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized,[1][2][3] because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.* – Cloud Sep 21 '15 at 13:25
  • 1
    I believe that on Linux using UTF-8 `char` is much better than `wchar_t` – Basile Starynkevitch Sep 21 '15 at 13:26
  • Basile, how would you make that work with letters like 'æ', 'ø' or 'å'? – Martin Johansen Sep 21 '15 at 13:31
  • @Dogbert Please note how Wikipedia says *The use of the term is sometimes criticized* (and lists the reasons why). Now you have encountered someone who criticizes the use of the term (myself). Where's a contradiction? – n. m. could be an AI Sep 21 '15 at 13:57
  • @BasileStarynkevitch depends on what you are doing. For character-level work, like scanning words for vowels, wchar_t is much easier. – n. m. could be an AI Sep 21 '15 at 14:08
  • Not much easier, and for scanning words that you have got in UTF-8, converting all the input to `wchar_t` is inefficient and error prone... – Basile Starynkevitch Sep 21 '15 at 14:09
  • @n.m. Because they **do** indeed exist. Just because Windows uses code pages and Linux uses locales, doesn't mean extended ASCII chars don't exist. The term is criticized because it seems to indicate ASCII supports `char` values above 127, or that the values on the range `[128,255]` are the same from system to system. The term itself is criticized for incorrect assumptions it causes readers to infer, not the validity of the existence of the term. As a counterexample to your point, `æ` maps to 145 in extended ASCII, but 230 in UTF8. Extended ASCII doesn't map to unicode equivalents. – Cloud Sep 21 '15 at 14:09
  • @BasileStarynkevitch Yes. What I'm getting at is that extended ASCII exists, just as a logical mapping to a specific set of 128 characters that change from platform to platform depending on locale settings, and that it doesn't map to UTF8 as `nm` noted before editing his/her comment. – Cloud Sep 21 '15 at 14:15
  • @Dogbert æ maps to 145 in a specific encoding called ISO8859-1. There are many encodings and charsets that can equally be called "extended ASCII" and æ is not in.most of them. **Which is exactly the reason why the term should never be used**. – n. m. could be an AI Sep 21 '15 at 14:17
  • I believe that using Unicode/UCS4 `wchar_t` is worse than [UTF8everywhere](http://utf8everywhere.org/) `char`-s so I downvoted that answer – Basile Starynkevitch Sep 21 '15 at 14:31
  • @BasileStarynkevitch "inefficient and error prone" can't really see how either of this is true. – n. m. could be an AI Sep 21 '15 at 14:57
  • @n.m. Extended ASCII's real meaning is simply the use of the upper/eighth bit to index into additional character maps, nothing more. The fact that a specific character exists in multiple character sets is irrelevant, and has no bearing on the actual definition of "extended ASCII". – Cloud Sep 21 '15 at 15:51
  • @Dogbert The term doesn't imply which map is to be used, only that there's an unspecified map from an unspecified set of characters to numbers 128-255. Why ever use such a vague term if you can just call the map by its name like ISO8859-15? – n. m. could be an AI Sep 21 '15 at 15:56
  • @n.m. Because I'm trying to draw attention to the general concept with respect to TC's post, and make a distinction between extended ASCII and code pages, rather than treat the two synonymously, as they are distinct concepts, despite them working side-by-side typically. – Cloud Sep 21 '15 at 16:25
4

Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

You need to understand what character encoding are you using.

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • I understand that UTF-8 can be several bytes, and I think that's the reason why 'å' was replaced with "�a". 'æ', 'ø', and 'å' are vowels in the norwegian and danish language. 'æ' is the sound a sheep make (baa) w.o. the 'b', 'ø' sounds like "uhh" and 'å' sounds like "oh". But the program doesn't have to work for every language, only norwegian :) – Martin Johansen Sep 21 '15 at 13:28
  • It says in the title. – Martin Johansen Sep 21 '15 at 13:39
  • Norvegian is not mentioned in the title or in the question. Languages have much more vowels than you think. ы & ê are obviously vowels, but you wrongly believe they are not. And I won't dare speaking about vowels in Hebrew or Arabic or Japanese or Cherokee, but I do know it is a tricky subject. – Basile Starynkevitch Sep 21 '15 at 13:48
  • how-to-do-operations-with-æ-ø-and-å-in-c. Maybe the title is bad. – Martin Johansen Sep 21 '15 at 13:51
  • 1
    @BasileStarynkevitch It's quite simple, really. None of these letters are vowels. Vowels are *sounds*. Letters relate to sounds in complex ways, there is often no 1:1 mapping. – n. m. could be an AI Sep 21 '15 at 14:11
  • @n.m. you should also convince the OP, [Martin Johansen](http://stackoverflow.com/users/5358860/martin-johansen). However, in elementary school, I (and my children and grandchildren) was taught that a e i o u y are all the vowels in French. – Basile Starynkevitch Sep 21 '15 at 14:16
  • @BasileStarynkevitch yeah, in the elementary school they tend to teach that. Not universally though. It's a simplified approach that works relatively well for some languages, not so well for others. – n. m. could be an AI Sep 21 '15 at 15:04
  • My point (to the OP, not to you `n.m`) is that the notion of vowels (in Unicode) is probably very complex. – Basile Starynkevitch Sep 21 '15 at 17:42