3

So I have this example string in C which contains the following:

char text_string[100] = "A panqueca americana é provavelmente o caféç da manhã mais famoso dos Estados Unidos.";

I need to find and replace special characters such as "ç" and turn them into their non-accent counterpart (for instance, "ç" would become "c").

I'm really struggling with this, searched around but couldn't find anything to help with this question. I tried usint strchr to compare the individual chars of the text to the special characters like I'll show below but it didn't work.

char transform_text(char *text_string){
    for(int i=0; i<100; i++){
        if(strchr("ç", text_string[i]) != NULL )
            text_string[i]='c';
}

Any suggestions? Thank you in advance.

Dingo
  • 93
  • 1
  • 9
  • 1
    Depends on encoding. What is `printf("%d\n", (int) sizeof("ç"));`? – chux - Reinstate Monica Apr 24 '21 at 15:51
  • 2
    `text_string` not used in the function and `list_string` not defined. Post compile-able code. – chux - Reinstate Monica Apr 24 '21 at 15:55
  • Just edited the answer, little typo when copying what i had into my post, it was text_string instead of list_string but that's corrected in my function. The printf you asked for prints the number 3. – Dingo Apr 24 '21 at 16:18
  • Change ** if(strchr("ç", text_string[i]) != NULL )** to **if (text_string[i] == 'ç') ** – Nicholas Hunter Apr 24 '21 at 16:32
  • I've tried that but I get an error of "Character too large for enclosing character literal type" – Dingo Apr 24 '21 at 16:36
  • 1
    You will want to use the library called ICU or libICU. See http://site.icu-project.org/ – Zan Lynx Apr 24 '21 at 18:57
  • 1
    Here is a little thing that does what you want in Python. Should give you ideas of how to do it in C. Basically normalize to NKFD and re-encode into ASCII. https://gist.github.com/tantale/a824fa0948d986d824e6a9965b488d5f – Zan Lynx Apr 24 '21 at 19:06
  • https://stackoverflow.com/questions/177113/utf-8-to-ascii-using-icu-library/1533156#1533156 – Zan Lynx Apr 27 '21 at 15:27

1 Answers1

2

On OP's system, "ç" is length 3, (e.g. '0xc3', '0xa7', '0x00') thus it is not encoded as a single character.

A common encoding is UTF8: U+00E7 ç c3 a7 LATIN SMALL LETTER C WITH CEDILLA

Instead look for the string "ç" inside text_string and substitute with the shorter string "c".

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256