0

Several years ago, I wrote a program in C++ to process text files. I now need it to work with unicode encoding. I'm not a professional programmer, so I would enjoy precise indications. Here is the relevant part of the code:

register char  c, d, e, f, *p, *q1=NULL, *q2=NULL;
char bufferhead[10000], tmp[100];
register char *buffer, *tmpbuffer;


c=fgetc(infile);
if(c==EOF){...}
if(c=='\n'){...}
fputc(c, outfile);

if(c==*q) return(1);

while(c!=' ' && c!=EOF && c!='\n' && j<90){
    tmp[j]=c;        
    j++; 
    c=fgetc(infile);
    }
tmp[j]=0;  
fputs(tmp, outfile);

*buffer=c;
buffer++;

switch(*buffer){
case '}':
    fputc('{', outfile);
    etc.
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
MikeTeX
  • 529
  • 4
  • 18
  • If you provide Unicode as UTF-8 you're done. One of the advantages of UTF-8 (compared to other encodings for Unicode) is that it usually can be processed by code which was developed for ASCII encoding. (Actually, it was designed with this feature in mind.) – Scheff's Cat Jun 28 '20 at 11:07
  • Sorry, I don't understand. Do you mean that I have nothing to do if I use UTF-8 and everything will work as is? – MikeTeX Jun 28 '20 at 11:18
  • `register char c .... if(c==EOF)` This cannot work. This could have never worked. If you need to adapt this program to any encoding you can stop now, because it will continue not working with any encoding exactly as before. [See this](https://stackoverflow.com/questions/43171841/file-handling-in-c-programming-what-is-the-difference-between-below-two-codes/43171863#43171863). – n. m. could be an AI Jun 28 '20 at 11:20
  • Strange, my code does work, but maybe this line is useless after all, and this is why I never notices this bug. – MikeTeX Jun 28 '20 at 11:23
  • @n.'pronouns'm. `char` isn't necessarily unsigned, so comparing it to a negative value can work. Both GCC and Clang treat it as a signed 8-bit integer (e.g. `char(255) == -1` will evaluate to true). – IlCapitano Jun 28 '20 at 11:26
  • Yes, I've just tested it. Char(255) == -1 evaluates to to true with my compiler. – MikeTeX Jun 28 '20 at 11:55
  • 1
    Unicode is the standard (enumerating all internal characters in a unique fashion). The characters are named code points. To use Unicode, you have to support one of the Unicode encodings UTF-8, UTF-16 (big-endian or little-endian), UTF-32, etc. UTF-8 encoding is very popular because it's based on 8 bit characters (can be stored with arrays of `char`). Additionally, the first 128 code-points of UTF-8 are encoded identical to ASCII. All other code points have the bit 7 set for all characters. Hence, it usually can processed with tools developed for ASCIIs. – Scheff's Cat Jun 29 '20 at 05:37
  • 1
    FYI: [Unicode](https://en.wikipedia.org/wiki/Unicode) and [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/). – Scheff's Cat Jun 29 '20 at 05:40
  • 1
    If you want to process the "ASCII part" of Unicode in a fashion you did before (and just let pass all the other code points) then you would be done with UTF-8. If you want to process e.g. all letters of Unicode then this was a wrong advice, of course. While ASCII has 26 letters for capitals and 26 for the non-capitals, there will be thousands in Unicode for the diverse languages all over the world. In this case, I would recommend a resp. library e.g. [ICU](http://site.icu-project.org/). FYI: [SO: Small open source Unicode library for C/C++](https://stackoverflow.com/q/745536/7478597) – Scheff's Cat Jun 29 '20 at 05:44
  • _Unicode is the standard (enumerating all internal characters in a unique fashion)._ Ouch! It should have been "Unicode is the standard (enumerating all _international_ characters in a unique fashion)." What a stupid typo... :-) – Scheff's Cat Jun 29 '20 at 05:48

0 Answers0