3

. I am new to c++. I have to find out the type of encoding the file contains which is passed by user. But i am not aware of how to check the encoding of file . so what i need is to print whether the file is unicode or ansi or unicode big endian or utf8.I have searched a lot but unable to find the solution. Till now i have done is i have opened a file :

#include "stdafx.h"
#include <iostream.h>
#include <stdio.h>
#include<conio.h>
#include <fstream>
using namespace std;



int _tmain(int argc, _TCHAR* argv[])
{
    fstream f;
    f.open("c:\abc.txt", fstream::in | fstream::out); /* Read-write. */


    getch();
    return 0;
}

SO please can anyone tell me the code solution to this.

what if i am accessing notepad file?

Thanx in advance..

Deanie
  • 2,316
  • 2
  • 19
  • 35
Anonymous
  • 1,726
  • 4
  • 22
  • 47

6 Answers6

5

You cannot.

The best thing you can do is to guess it or save encoding as part of your file structure (if you can).

oleksii
  • 35,458
  • 16
  • 93
  • 163
  • why notepad++ always knows to show the txt file with the right format? – michaeltang Feb 18 '14 at 11:27
  • 4
    It doesn't! It makes pretty good guesses with English text. But I have seen it many times failing with non-English sources, like Russian can be Windows-1251 or KOI8-R (among several others), so I had to go to Encoding -> Character Set -> Cyrillic and try couple, before I can read the text. – oleksii Feb 18 '14 at 11:33
2

As discussed here, the only thing you can do is guess in the best order which is most likely to throw out invalid matches.

You should check, in this order:

  • Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
  • Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
  • If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
  • If the above didn't result in a positive match, it's probably ANSI.
Community
  • 1
  • 1
herohuyongtao
  • 49,413
  • 29
  • 133
  • 174
  • 1
    There's also a UTF-32 BOM which should be considered. Beyond that, it is mostly guessing, and the most likely guesses depend on the locale. Where I am (or have been), if the entire file would be legal UTF-8, that's probably what it is; otherwise, either ISO 8859-1 or ISO 8859-15 (but by that point, you really are guessing). – James Kanze Feb 18 '14 at 11:33
  • 1
    Also, if every other byte is 0, or most of them are, then it's probably UTF-16, big endian or little depending on which byte is 0. Same thing for three bytes out of four 0, and UTF-32. – James Kanze Feb 18 '14 at 11:34
2

Here i have found a way to detect the notepad file ,whether it is Unicode,Big Endian,UTF8 or simple ANSI file:

I found that when i save file in notepad by default it stores Byte of Mark(BOM) at the start of file.So i decided to use it as per earlier suggestions in this question.

First of all i read 1 byte of my file. I was already known that

  1. if file is Unicode file then its first two charactors stores FE FF i.e.254 255 is decimal equivalent of it.
  2. if file is UTF8 file then its first charactors stores FF and 239 is decimal equivalent of it.

here is code :

#include<conio.h>
#include<stdio.h>
#include<string.h>
int main()
{
        FILE *fp=NULL;
        int c;
        int i = 0;
        fp=fopen("c:\\abc.txt","rb");

        if (fp != NULL)
        {
            while (i<=3)
            {
                        c = fgetc(fp);    
                        printf("%d",c);
                            if(c==254)
                            {
                                printf("Unicode Big Endian File");
                            }
                            else if(c==255)
                            {
                                printf("Unicode Little Endian File");
                            }
                            else if(c==239)
                            {
                                printf("UTF8  file");
                            }
                            else 
                            {
                                printf("ANSI File");
                            }

              }
              fclose(fp);

       }

        
        getchar();

    return 0;
}

This worked fine for me.Hope will work for others also.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
Anonymous
  • 1,726
  • 4
  • 22
  • 47
  • There is no such thing as "Unicode Big Endian". Unicode is a very large character set, which assigns numbers to code points. It does not deal with encoding these numbers into a byte stream. That's instead the job of Unicode _encodings_, such as UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. Please read about the [Byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) in order to write better code. – Roland Illig Jul 04 '21 at 12:10
1

You cannot know what a encoding a text file has. One way to do it would be to look for the BOM at the beginning of the file, and that would tell you whether the text is in Unicode. However, the BOM is not mandatory, so you cannot rely on that in order to differentiate Unicode from other encodings.

A very common way to present this problem is that there is no such thing as plain text.

I'm Spanish, and you can easily find here text files in 7-bit ASCII, extended ASCII, ISO-8859-1 (aka Latin 1, which includes many common extra characters needed for western europe), and also UTF in its varios flavours.

Hope this somehow helps.

Baltasarq
  • 12,014
  • 3
  • 38
  • 57
1

Files generally indicate their encoding with a file header.
And as others suggested you can never be sure what encoding a file is really using.

Follow these links to get a general idea :
Using Byte Order Marks
FILE SIGNATURES TABLE

Aseem Goyal
  • 2,683
  • 3
  • 31
  • 48
-1

open your file with Notepad++ and go to the Encoding on the top menu to see the encoding type of the file See here

Shah_MRI
  • 11
  • 1
  • 1
    This is a programming site, solutions must involve programming. OP is asking how to fing the encoding *in C++*. – Eric Aya Jul 02 '21 at 13:00