1

Apologies for the nubeness of the question but I've been chasing my tail for days.

I need to create a function that can verify whether the encoding of a buffer being received is actually UTF8 and then do a basic regex to exclude unwanted control chars.

I started by recursively dumping:

     0x62     
     0xCDBC     
     0xE0AC89     
     0xF09F8489   

Into a test file.

It worked fine, copied the file and text editors from windows, Linux & mac can read it (and display the correct char's)

But when I try to read it back into my test function it dies, added a

char c = fs->get();    
while(fs->good())
{     
     int len = sizeof(c);     
     printf("0x%X        ---    %i\n",c,len);     
     c = fs->get();     
}  

Yes I know the code sucks..

but what I don't understand is why I'm getting this on the output.

Hex                    sizeof()
0x26              ---    1     
0xFFFFFFCD        ---    1     
0xFFFFFFBC        ---    1    
0xFFFFFFE0        ---    1     
0xFFFFFFAC        ---    1     
0xFFFFFF89        ---    1     
0xFFFFFFF0        ---    1     
0xFFFFFF9F        ---    1     
0xFFFFFF84        ---    1     
0xFFFFFFB9        ---    1     

The 0x62 becomes a 0x26 whilst all the other numbers are correct but padded into a 64 bit pattern...?

locale is EN_en.utf8

I'm lost, any ideas out there?

Thanks in advance Bob.

BlackBeard
  • 10,246
  • 7
  • 52
  • 62
Bob Bit
  • 57
  • 6

0 Answers0