4

I stumbled upon strange behavior of string::substr. Normally I code on Windows 7 in Eclipse+MinGW, but when I was working on my laptop, using Eclipse in Linux (Ubuntu 12.04) I noticed difference in result.

I was working with vector< string > filled with lines of text. One of steps was to remove last character from line.

In win7 Eclipse I did:

for( int i = 0; i < (int)vectorOfLines.size(); i++ )
{
    vectorOfTrimmedLines.push_back( ((string)vectorOfLines.at(i)).substr(0, ((string)vectorOfLines.at(i)).size()-1) );
}

and it works like intended (removing last character from each line)

But in Linux this code do not trim. Instead I needed to do it like this:

//  -2 instead -1 character
vectorOfTrimmedLines.push_back( ((string)vectorOfLines.at(i)).substr(0, ((string)vectorOfLines.at(i)).size()-2) );

or using another method:

vectorOfTrimmedLines.push_back( ((string)vectorOfLines.at(i)).replace( (((string)vectorOfLines.at(i)).size()-2),1,"",0 ));

Ofcourse Linux methods work wrong way on windows (trimming 2 last characters, or replacing one before last).

The problem seems to be that myString.size() return number of characters in Windows, but in Linux it returns number of characters + 1. Could it be that new line character is counted on Linux?

As a newbie in C++ and programming general, I wonder why it is like that, and how can this be done to be platform independent.

Another thing that I wonder is : which method is preferable (faster) substr or replace?

Edit: Method used to fill string s this function i wrote:

vector< string > ReadFile( string pathToFile )
{
    //  opening file
    ifstream myFile;
    myFile.open( pathToFile.c_str() );

    //  vector of strings that is returned by this function, contains file line by line
    vector< string > vectorOfLines;

    //  check if the file is open and then read file line by line to string element of vector
    if( myFile.is_open() )
    {
        string line;    //  this will contain the data read from current the file

        while( getline( myFile, line ) )    //  until last line in file
        {
            vectorOfLines.push_back( line );    //  add current line to new string element in vector
        }

        myFile.close(); //  close the file
    }

    //  if file does not exist
    else
    {
        cerr << "Unable to open file." << endl; //  if the file is not open output
        //throw;
    }

    return vectorOfLines;   //  return vector of lines from file
}
RegEx
  • 125
  • 3
  • 10
  • 1
    Save yourself some stress, use [Boost](http://stackoverflow.com/questions/216823/whats-the-best-way-to-trim-stdstring) – Perception Oct 06 '12 at 14:21
  • 1
    Show the method that was used to fill the strings in the first place. – Benjamin Lindley Oct 06 '12 at 14:22
  • 3
    Why the typecasting? Why the use of `at` instead of the `[]` operator? Why not use iterators in the loop? And finally, are you sure the strings are actually the same in Linux as in Windows? – Some programmer dude Oct 06 '12 at 14:24
  • At firs glance, I would suspect an embedded '\0' at the end of the strings. Do you know how to print string out as hex? – Jeffery Thomas Oct 06 '12 at 14:25
  • 7
    @JoachimPileborg - I'm guessing that the strings are **exactly** the same in Linux as in Windows, but that they're coming from a text file that was written under Windows, so have two characters representing the newline. – Pete Becker Oct 06 '12 at 14:26
  • 2
    *"Could it be that new line character is counted on Linux?"* - It is always counted if it is in the string, since it is a normal character. I guess the problem is rather that your strings have different values in the first place. – Christian Rau Oct 06 '12 at 14:26
  • 1
    I suspect that you're running Linux in a virtual box in Windows and accessing Windows text file with each line terminated by CRLF. I.e. the code works as it should for strings of sufficient size, but the assumption s about those strings are wrong. Also please remove your casts. – Cheers and hth. - Alf Oct 06 '12 at 14:27
  • @BenjaminLindley - added method in edit – RegEx Oct 06 '12 at 14:27
  • Please note that the **unsigned arithmetic** will blow up in your face for sufficiently small string sizes. – Cheers and hth. - Alf Oct 06 '12 at 14:28
  • 1
    as for other comments, guys I'm newbie, I'm learning all of it myself so surely i lack some understanding of methods, but at the moment I use methods I know about, more will come (hopefully) along the road... so sorry if I do not do things the best way... – RegEx Oct 06 '12 at 14:28
  • @Cheersandhth.-Alf - no i run Linux on seperate machine as full OS – RegEx Oct 06 '12 at 14:30
  • @RegEx I wonder though what serious learning resource (and nobody learns a programming language like C++ without a serious learning resource) teaches `std::string::at` before `std::string::operator[]`. – Christian Rau Oct 06 '12 at 14:30
  • Yes, strings are exactly the same on windows and linux, hell, everything is exactly the same, thats why i was puzzled why it produced different results and came up with solution for linux. everything apart this pieces of code is exactly the same – RegEx Oct 06 '12 at 14:34
  • 1
    @PeteBecker - missed your comment. Yes text file is a log file created in Windows... – RegEx Oct 06 '12 at 14:37
  • @ChristianRau: `std::string::at()` provides automatic bounds checking, it's valid for a teaching resource to consider `std::string::at()` to be a safer alternative for newbies rather than `std::string::operator[]`. In fact, IMO, if you had to teach only one of them I'd say you should teach `std::string::at()` rather than `std::string::operator[]`; it's rather rare that you really want to be able to access memory outside the area allocated for the string. – Lie Ryan Oct 06 '12 at 15:04
  • @LieRyan Yes, and because of this you rarely ever try to access the string out of bounds. If you do, your program is broken anyway and getting an exception doesn't help with that. But I agree that there are some rare cases where accessing out of bounds may indeed happen and an exception might be appropriate. Because of this `std::string::at` is rather special case function. – Christian Rau Oct 06 '12 at 15:08
  • @ChristianRau: it's very easy for beginners to make out of bounds error; usually off-by-one calculation. Once you've been programming for a while, you instinctively avoid those kinds of errors, but for beginners getting an exception, with line number at debugger, is much more helpful than silently corrupting the memory. Of course, once you've got your code correct, you should be able to convert it to `std::string::operator[]` and have everything work the same, but except on a really tight loop, leaving it as is does no harm; OOB exception is not meant to be caught, it's a debugging aid. – Lie Ryan Oct 06 '12 at 15:29
  • @LieRyan But then again a proper standard library gives you a proper assertion failure or exception in debug mode anyway. But if it doesn't, you got a point. Let's just hope the beginner that learned to always use `std::string::at` knows about its implications and doesn't just blindly follow this path later on. – Christian Rau Oct 06 '12 at 15:51

3 Answers3

9

Text files are not identical on different operating systems. Windows uses a two-byte code to mark the end of a line: 0x0D, 0x0A. Linux uses one byte, 0x0A. getline (and most other input functions) knows the convention for the OS that it was compiled for; when it reads the character(s) that the OS uses to represent the end of a line, it replaces the character(s) with '\n'. So if you write a text file under Windows, lines end with 0x0D, 0x0A; if you read that text file under Linux, getline sees 0x0D and treats it as a normal character, then it sees 0x0A, and treats it as the end of the line.

So the moral is that you must convert text files to the native representation when you move them from one system to another. ftp knows how to do this. If you're running in a virtual box, you have to do the conversion manually when you switch systems. It's simple enough with tr from a Unix command line.

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
4

This is because in Windows, newline is represented by two characters CR+LF, while on Linux it's only LF, and on Mac (prior to OSX) it's only CR.

As long as you only use files generated on Linux on Linux systems or files generated on Windows on Windows systems, you would have nothing to worry about. But as soon as you need to use a file generated on Linux on Windows or vice versa, you need to handle newline correctly.

As a first step, you need to open the file in binary mode std::ofstream infile( "filename", std::ios_base::binary);, then you have three options:

  1. You need to decide on a single newline convention for all platforms and use it consistently,
  2. You need to be able to detect the newline convention used in the current file (usually implemented by checking the newline used on the first line), save that in a variable, and pass it around to string functions that need to deal with newline,
  3. Tell the user to convert the file to the right newline, e.g. using dos2unix and unix2dos, or if the file transfer involves FTP, use ASCII mode

Or, as has been said, use Boost.

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
  • Only seconds behind after me accepting other response, but thank you for your input that is correct! – RegEx Oct 06 '12 at 14:42
0

The line endings are not the same in Windows and Linux/Unix - Windows uses two bytes and Linux uses one. Google how to use tr on the .nix command line and you will see how convert them.

Good luck!

Kenzo
  • 3,513
  • 4
  • 17
  • 16