Comparing and replacing accented/special characters in C

Question

So I have this example string in C which contains the following:

char text_string[100] = "A panqueca americana é provavelmente o caféç da manhã mais famoso dos Estados Unidos.";

I need to find and replace special characters such as "ç" and turn them into their non-accent counterpart (for instance, "ç" would become "c").

I'm really struggling with this, searched around but couldn't find anything to help with this question. I tried usint strchr to compare the individual chars of the text to the special characters like I'll show below but it didn't work.

char transform_text(char *text_string){
    for(int i=0; i<100; i++){
        if(strchr("ç", text_string[i]) != NULL )
            text_string[i]='c';
}

Any suggestions? Thank you in advance.

Depends on encoding. What is `printf("%d\n", (int) sizeof("ç"));`? — chux - Reinstate Monica, Apr 24 '21 at 15:51
`text_string` not used in the function and `list_string` not defined. Post compile-able code. — chux - Reinstate Monica, Apr 24 '21 at 15:55
Just edited the answer, little typo when copying what i had into my post, it was text_string instead of list_string but that's corrected in my function. The printf you asked for prints the number 3. — Dingo, Apr 24 '21 at 16:18
Change ** if(strchr("ç", text_string[i]) != NULL )** to **if (text_string[i] == 'ç') ** — Nicholas Hunter, Apr 24 '21 at 16:32
I've tried that but I get an error of "Character too large for enclosing character literal type" — Dingo, Apr 24 '21 at 16:36
You will want to use the library called ICU or libICU. See http://site.icu-project.org/ — Zan Lynx, Apr 24 '21 at 18:57
Here is a little thing that does what you want in Python. Should give you ideas of how to do it in C. Basically normalize to NKFD and re-encode into ASCII. https://gist.github.com/tantale/a824fa0948d986d824e6a9965b488d5f — Zan Lynx, Apr 24 '21 at 19:06
https://stackoverflow.com/questions/177113/utf-8-to-ascii-using-icu-library/1533156#1533156 — Zan Lynx, Apr 27 '21 at 15:27

score 2 · Accepted Answer · answered Apr 24 '21 at 18:08

On OP's system, "ç" is length 3, (e.g. '0xc3', '0xa7', '0x00') thus it is not encoded as a single character.

A common encoding is UTF8: U+00E7 ç c3 a7 LATIN SMALL LETTER C WITH CEDILLA

Instead look for the string "ç" inside text_string and substitute with the shorter string "c".

Comparing and replacing accented/special characters in C

1 Answers1