getline() text with UNIX formatting characters

Question

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.

The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.

The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.

Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.

The code which reads the data is as follows:

//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
    string line;
    ifstream codeFile;

    //open text file
    codeFile.open(inFilePath,ios::in);

    //read file line by line
    while ( codeFile.good() )
    {
       getline(codeFile,line);

       //check non-zero length
       if (line != "")
            ProcessLine(&line[0]);
    }

    //close line
    codeFile.close();

    return 1;
}

If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.

also, small tip for posting: you'll want to use spaces rather than tabs for posting code to get the indentation you are expecting. — MartyE, Aug 21 '12 at 16:33
Can you give a little more detail about the "bizarre formatting characters"? In particular, what are the hex values that the file contains? I have a guess, but I'm not willing to post it unless it's actually appropriate. — Pete Becker, Aug 21 '12 at 16:33
A much clearer and correcter way to write your loop: `std::ifstream codeFile(inFilePath); for (std::string line; std::getline(codeFile, line); ) { /*...*/ }` — Kerrek SB, Aug 21 '12 at 16:34
I'm guessing "bizarre formatting characters" such as "Smiley Faces" are just non-ascii byte values. Keep in mind you _may_ need to account for unicode cases where its simply multi-byte characters (not Unix specific) — MartyE, Aug 21 '12 at 16:34
Could you provide the first several bytes in the file e.g., using `od -c`. Do you know [the character encoding of the text](http://www.joelonsoftware.com/articles/Unicode.html)? — jfs, Aug 21 '12 at 16:54
It sounds like you're reading unicode text or something non ascii. I would suggest using the std library functions not the old C style function calls. — Justin, Aug 21 '12 at 19:50
Hi, thanks for the replies. I tried to paste the first few lines of text here but the unusual characters in question just disappear. Also it is mostly confidential medical information so I can't really share it. From what I have been able to manually identify, the most frequent, problem causing character is 0x1B / ascii 27. There are also a few ascii 10s and 12s. — Smoggie Tom, Aug 22 '12 at 08:31
Escape, line-feed, and form-feed. What you have was meant to be sent to a printer. But this should not disturb reading the file. Maybe your processing stumbles over this stuff. — rtlgrmpf, Sep 12 '12 at 14:25
Unless I'm mistaken I believe that ifstream will set EOF to true, if he encounters the SUB char (0x1A). — João Augusto, Oct 26 '12 at 16:13

score 0 · Answer 1 · edited May 23 '17 at 12:21

From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.

You have a couple of choices:

If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.

You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.

If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).

If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.

For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).

codeFile.open(inFilePath,ios::in);

would become

codeFile.open(inFilePath, ios::in | ios::binary);

Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.

Reading will be like this:

// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);

int actual_read = codeFile.gcount();

// Here you can process input, up to a maximum of actual_read characters.

//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);

The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:

imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

getline() text with UNIX formatting characters

1 Answers1