1

I have a .txt file called "1.txt" that I want to read in. Since the file starts with 8 BOM characters, if I do the following:

ifstream fin("1.txt");

string temp = "";

char c = fin.get();

    while (!fin.eof())
    {
        if (c >= ' ' && c <= 'z')
        {
            temp += c;
        }

        c = fin.get();
    }

    cout << temp;

This will print nothing, because of something the BOM is doing.

So, I decided to use the fin.ignore() function, in order to ignore the beginning BOM characters of the file. However, still nothing is being printed. Here is my complete program:

#include <iostream>
#include <fstream>
#include <string>
#include <istream>

using namespace std;

int main()
{
ifstream fin("1.txt");

if (fin.fail())
{
    cout << "Fail\n";
}

else
{
    string temp = ""; // Will hold 1.txt's contents.

    fin.ignore(10, ' ');
    // Ignore first 10 chars of the file or stop at the first space char,
    // since the BOM at the beginning is causing problems for fin to read the file.
    // BOM is 8 chars, I wrote 10 to just be safe.

    char c = fin.get();

    while (!fin.eof())
    {
        if (c >= ' ' && c <= 'z') // checks if c stores a standard char.
        {
            temp += c;
        }

        c = fin.get();
    }

    cout << temp;

    // PROBLEM:  No text is printed to the screen from the above command.

    cout << temp.size(); // prints 0
}
}

I hypothesize that after the: ifstream fin("1.txt"); line, it is already too late, since the BOM probably affected things with fin then. So I need to somehow tell fin to ignore the BOM characters before it reads in the file, but I can't use fin.ignore() since I wouldn't have declared a fin object yet.

Also, I know I can manually delete the BOM from my .txt file, but I'm looking for a solution that only involves me writing a C++ program. If I have thousands or millions of .txt files, deleting manually is not an option. Also, I'm not looking to download new software, like Notepad++

Here is all I have in the file "1.txt":

ÐÏࡱá Hello!

This site's formatting doesn't let me show it, but in the actual file there are about 15 spaces between the BOM and Hello!

  • At least I finally figured out what that is. It's the start of a .doc file (which has nothing to do with BOMs). – chris Jan 01 '18 at 08:37
  • @chris Do you know of any way I can get rid of it in C++? – Inertial Ignorance Jan 01 '18 at 08:39
  • 1
    I don't see a problem with the previous suggestion of `.ignore`, but you'd better be sure it suffices to look only 10 bytes in. In an empty .doc file I created with Word 2016, it's 4106 bytes before the first space and there's still other stuff after it. `std::ifstream` doesn't guess at encodings or anything. I'd suggest exploring what happens line-by-line with a debugger to narrow down the problem. – chris Jan 01 '18 at 08:50
  • @chris Sorry, which previous suggestion of .ignore are you referencing? – Inertial Ignorance Jan 01 '18 at 08:52
  • The one from the [last question](https://stackoverflow.com/questions/48047404/how-can-i-use-c-to-eliminate-the-bom-in-a-notepad-txt-file). – chris Jan 01 '18 at 08:53
  • @Chris I tried Galik's suggestion in my code above though, but I'm still having the same problem. – Inertial Ignorance Jan 01 '18 at 08:56
  • It looks like you are reading a *.doc binary file in text mode, and for some reason you think there is BOM in there. Open the file in binary mode `ifstream fin("1.txt", ios::binary);` and print the bytes `char buf[1]; while(fin.read(buf, 1)) cout << (int)buf[0] << " ";` to examine the values. – Barmak Shemirani Jan 01 '18 at 13:05
  • This probably isn't the problem, but change `if (c >= ' ' && c <= 'z')` to `if (!std::iscntrl(c))` or, if you only want printable characters, `if (std::isprint(c))`. **That** "will check if c stores a standard character". As written it assumes a particular character encoding, and will give incorrect results if that assumption is wrong. – Pete Becker Jan 01 '18 at 14:35
  • @Barmak Shemirani If I read the file in in binary format, could I then just skip the first 10 bytes of so? Would that ignore the first part of the .txt file? – Inertial Ignorance Jan 01 '18 at 18:20
  • @ Pete Becker I tried both of your suggestions, but I'm still getting nothing printed to the screen. – Inertial Ignorance Jan 01 '18 at 18:21
  • My answer to this question was found by dumb luck, to be quite honest. I already recommended using a debugger to figure out what exactly is happening. It turns out that a debugger would trivialize the process of finding the problem here. You'd immediately notice that `.eof()` returns true after the `ignore` call, prompting you to take out the `ignore` call and read byte by byte to see which one causes it, quickly seeing that the \x1a byte is the cause. This leads to a much more specific question of why \x1a is causing EOF or a much easier google search. – chris Jan 01 '18 at 18:56
  • You can use `ifstream::seekg` to go to a specific location in the file. But it won't help much if you have a *.doc binary file. The conversion of DOC format to text is not trivial. – Barmak Shemirani Jan 01 '18 at 19:16

1 Answers1

2

According to cppreference, the character with value \x1a terminates input on Windows in text mode. You presumably have such a character right near the beginning. My empty .doc file has one as the 7th byte.

You should read the file in binary mode:

std::ifstream fin("1.txt", std::ios::binary);

You can still use ignore to ignore a prefix. However, it's kind of flaky ignoring until a specific character. The binary prefix could contain that character. If these prefixes are always the same length, ignoring a specific number of bytes suffices. In addition, you can't rely on looking at the file in Notepad to count the number of bytes. There are quite a few invisible characters. You should be looking at the hex view of the file instead. Many good text editors can do this, or you can use Powershell's Format-Hex -Path <path> command. For example, here's the first few lines of mine:

00000000   D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00  ÐÏ.ࡱ.á........
00000010   00 00 00 00 00 00 00 00 3E 00 03 00 FE FF 09 00  ........>...þ...
00000020   06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00  ................

It's unclear what the best way to remove the prefixes is without more information.

chris
  • 60,560
  • 13
  • 143
  • 205
  • I used ios::binary to read in the file and it worked perfectly. Now I have a big string with a bunch of Word text at the beginnings and ends of each file's contents, but trimming them shouldn't be a problem. There are a few signs I can use; for example, if a program finds there aren't 10 printable characters in a row, that area is almost definitely in a Word prefix or suffix. I could then delete all the characters until reaching a big break of space chars, since they separate the prefix, main text, and suffix of each file's contents. – Inertial Ignorance Jan 03 '18 at 09:24