0

I'm working on a C school assignment that is intended to be done on Windows, however, I'm programming it on OS X. While the other students working on Windows don't have problems reading a file, I do.

The code provided by the tutors splits the contents of a file on \n using this code:

/* Read ADFGX information */
adfgx = read_from_file("adfgx.txt");

/* Define the alphabet */
alphabet = strtok(adfgx, "\n");

/* Define the code symbols */
symbols = strtok(NULL, "\n");

However, the file adfgx.txt (which is provided for the assignment) has Windows style newlines (\r\n): I checked it with a hex editor. So, compiling this with the Microsoft C compiler from Visual Studio and running it on Windows splits the file correctly on newlines (\r\n). Which I think is weird, because I can not find any documentation on this behavior. The other part: when I compile it on OS X using gcc, and I run it: the \r is still included in the tokenized string, because it obviously splits on \n. If I change the delimiters to the strtok call to "\r\n", it works for me.

Is this normal that this behaves differently on Windows and Unix? How should I handle this in real life situations (assuming I'm trying to write portable code for Windows and Unix in C that should handle file input that uses \r\n)?

Martijn Courteaux
  • 67,591
  • 47
  • 198
  • 287
  • 2
    how do you open the file? – Karoly Horvath Mar 08 '15 at 16:38
  • 2
    Best solution: when transferring the file from Windows to Unix get rid of the `'\r'` ... see http://stackoverflow.com/questions/2613800/how-to-convert-dos-windows-newline-crlf-to-unix-newline-n-in-bash-script – pmg Mar 08 '15 at 16:44
  • You can use multiple delimiters such as `strtok(adfgx, " \t\r\n");` any combination of which (in any sequence) is treated as a single delimiter. If you want the code to be portable it won't hurt checking for `\r` as well as `\n`. – Weather Vane Mar 08 '15 at 16:46
  • With Windows every weird behavior is "normal". Create preprocessor code to check for the platform and define a constant accordingly. Alternatively you can use a dostounix text file converter to get the files in shape to work for you. – Tarik Mar 08 '15 at 16:49
  • 1
    Windows has two different file-opening modes unlike some other systems: binary and text. When in text mode, `\r\n` is automatically translated to `\n` during read/write operations. This differs from other systems where the newline sequence is simply `\n`. Mac systems before OS X used `\r` for files rather than `\n` to add to the confusion. If you want to do it to match behavior across all platforms, open the file in binary mode and break on `\r\n`. It should suffice for class, though I often worry about those files with `\r` and no `\n` to follow when doing this. Hopefully it's fine. –  Mar 08 '15 at 16:49

1 Answers1

2

If you open the file with fopen("adfgx.txt", "r") on Windows, the file gets opened in "text mode" and the \r char gets implicitly stripped from subsequent fread calls. If you had opened the file on Windows with fopen("adfgx.txt", "rb"), the file gets opened in "binary mode", and the \r char remains. To learn about the "rb" mode, and other mode strings, you can read about the different mode parameters that fopen on Windows takes here. And as you might imagine, fwrite on Windows will automatically insert a \r into the stream in front of the \n char (as long as the file was not opened in binary mode).

Unix and MacOS treat \r as any ordinary character. Hence, strok(NULL, "\n") won't strip off the '\r' char, because you are not splitting on that.

The easy cross-platform fix would be to invoke strtok as follows on all platforms:

/* Define the alphabet */
alphabet = strtok(adfgx, "\r\n");

And I think passing "\r\n" as the delimiter string will clear up most of your issues of reading text files on Windows and vice-versa. I don't think strtok will return an empty string in either case, but you might need to check for an empty string on each strtok call (and invoke it again to read the next line).

selbie
  • 100,020
  • 15
  • 103
  • 173
  • That explains a whole lot! Thank you very much. As I mentioned in my question: using `"\r\n"` as delimiter string fixed it for me, so this means that `strtok` indeed doesn't return an empty string. – Martijn Courteaux Mar 08 '15 at 18:12
  • I ran into another inconvenience: using a method to seek to the end of a file descriptor that is opened in "r" mode, it returns the actual length of the file, instead of the stripped length where "\r\n" is replaced by "\n". So using this technique to determine the size of a buffer causes the constructed buffer to contain garbage at the end, because the read method strips the \r out, but our buffer has allocated space to include the \r. – Martijn Courteaux Mar 09 '15 at 17:06
  • There's plenty of workarounds to this issue. You could just "zero-out" your buffer or fill it completely with `\n` chars before reading the file into it. – selbie Mar 09 '15 at 18:32