7

I'm trying to read a text file, and for each word, I will put them into a node of a binary search tree. However, the first character is always read as " + first word". For example, if my first word is "This", then the first word that is inserted into my node is "This". I've been searching the forum for a solution to fix it, there was one post asking the same problem in Java, but no one has addressed it in C++. Would anyone help me to fix it ? Thank you.

I came to the a simple solution. I opened the file in Notepad, and saved it as ANSI. After that, the file is reading and passing correctly into the binary search tree

Hoang Minh
  • 1,066
  • 2
  • 21
  • 40
  • Where is your code that is producing the problem? – Austin T French Dec 26 '13 at 04:00
  • Don't you think it would help if you showed us some code? – OldProgrammer Dec 26 '13 at 04:00
  • What's the origin of the file? It's possible the file format might be documented. – Captain Obvlious Dec 26 '13 at 04:04
  • @CaptainObvlious: It's in a school project, and the text file is given by our professor. I just googled and found out the solution. It's the way that the text file is saved causing the problem. I'm using notepad editor, and the file was saved as Unicode. So, in order to fix the problem, all you do is just save the file as ANSI coding, and the problem just goes away :)) – Hoang Minh Dec 26 '13 at 04:11
  • 1
    @LưuVĩnhPhúc: I did not know that there was a way to mark an answer. Thanks. – Hoang Minh Jun 24 '14 at 23:58
  • 1
    you can edit your already asked questions to [make it better](http://stackoverflow.com/help/how-to-ask) to reduce/remove the downvotes and get more people's attention – phuclv Jun 25 '14 at 01:52

3 Answers3

20

That's UTF-8's BOM

You need to read the file as UTF-8. If you don't need Unicode and just use the first 127 ASCII code points then save the file as ASCII or UTF-8 without BOM

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • It is strictly the UTF-8 encoding of U+FEFF, the BOM (also a zero-width no-breaking space, ZWNBSP), presented using the code set ISO 8859-1. UTF-8 does not need a BOM, of course. – Jonathan Leffler Dec 26 '13 at 04:19
4

This is Byte Order Mark (BOM). It's the representation for the UTF-8 BOM in ISO-8859-1. You have to tell your editor to not use BOMs or use a different editor to strip them out.

In C++, you can use the following function to convert a UTF-8 BOM file to ANSI.

void change_encoding_from_UTF8BOM_to_ANSI(const char* filename)
{
    ifstream infile;
    string strLine="";
    string strResult="";
    infile.open(filename);
    if (infile)
    {
        // the first 3 bytes (ef bb bf) is UTF-8 header flags
        // all the others are single byte ASCII code.
        // should delete these 3 when output
        getline(infile, strLine);
        strResult += strLine.substr(3)+"\n";

        while(!infile.eof())
        {
            getline(infile, strLine);
            strResult += strLine+"\n";
        }
    }
    infile.close();

    char* changeTemp=new char[strResult.length()];
    strcpy(changeTemp, strResult.c_str());
    char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
    strResult=changeResult;

    ofstream outfile;
    outfile.open(filename);
    outfile.write(strResult.c_str(),strResult.length());
    outfile.flush();
    outfile.close();
}
herohuyongtao
  • 49,413
  • 29
  • 133
  • 174
1

in debug mode findout the symbol for the special character and then replace it

content.replaceAll("\uFEFF", "");
shubham kumar
  • 271
  • 2
  • 6