Open non text file without windows line ending

Question

I took over a project that use the following function to read files:

char *fetchFile(char *filename) {
    char *buffer;
    int len;
    FILE *f = fopen(filename, "rb");
    if(f) {
        if(verbose) {
            fprintf(stdout, "Opened file %s successfully\n", filename);
        }
        fseek(f, 0, SEEK_END);
        len = ftell(f);
        fseek(f, 0, SEEK_SET);
        if(verbose) {
            fprintf(stdout, "Allocating memory for buffer for %s\n", filename);
        }
        buffer = malloc(len + 1);
        if(buffer) fread (buffer, 1, len, f);
        fclose (f);
        buffer[len] = '\0';
    } else {
        fprintf(stderr, "Error reading file %s\n", filename);
        exit(1);
    }
    return buffer;
}

The rb mode is used because sometimes the file can be a spreadsheet and therefore I want the information as in a text file.

The program runs on a linux machine but the files to read come from linux and windows.

I am not sure of what approach is better to not have windows line ending mess with my code.

I was thinking of using dos2unix at the start of this function. I also thought of opening in r mode, but I believe that could potentially mess things up when opening non-text files.

I would like to understand better the differences between using:

dos2unix,
r vs rb mode,
or any other solution which would fit better the problem.

Note: I believe that I understand r vs rb modes, but if you could explain why it is a bad or good solution for this specific situation (I think it wouldn't be good because sometimes it opens spreadsheets but I am not sure of that).

Will this run on a non-Windows system, where you sometime read Windows-formatted text files (with the Windows `"\r\n"` line ending)? If you run this *on* a Windows system with a file containing Windows line endings, then using binary mode will not translate the `"\r\n"` into plain `"\n"`. If on the other hand you use this on a non-Windows system to read files *with* Windows line endings, then I suggest you handle them in the code that uses the text instead. — Some programmer dude, Jul 05 '17 at 14:02
I updated my question to address your questions. Let me know if it is enough or if I should add more details. — user1527152, Jul 05 '17 at 14:13
What kind of files is your program reading? Text files or binary (non text) files? — Jabberwocky, Jul 05 '17 at 14:14
And you do not know in advance? Or could you guess by file ending? What, if you read in a true binary file that by accident contains the byte sequence '\r', '\n'. If I imagine you dropped this from some .exe file... — Aconcagua, Jul 05 '17 at 14:23
Yeah no I don't. That's my problem, I realized that sometimes I have the windows line ending remaining and I am trying to figure out what is the best way to get rid of it. Now I wonder if my question is not clear enough — user1527152, Jul 05 '17 at 14:26
@user1527152 what are you doping with the files once they have been read into memory? — Jabberwocky, Jul 05 '17 at 14:27
read line by line to process the information. it is supposed to be text (or csv) files but sometimes it is formatted in a spreasheet. — user1527152, Jul 05 '17 at 14:31
Related: [How to read / parse input in C? The FAQ](https://stackoverflow.com/a/35178521/2402272). — John Bollinger, Jul 05 '17 at 14:37

score 1 · Accepted Answer · answered Jul 05 '17 at 14:51

If my understanding is correct the rb mode is used because sometimes the file can be a spreadsheet and therefore the programs just want the information as in a text file.

You seem uncertain, and though perhaps you do understand correctly, your explanation does not give me any confidence in that.

C knows about two distinct kinds of streams: binary streams and text streams. A binary stream is simply an ordered sequence of bytes, written and / or read as-is without any kind of transformation. On the other hand,

A text stream is an ordered sequence of characters composed into lines, each line consisting of zero or more characters plus a terminating new-line character. Whether the last line requires a terminating new-line character is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to differing conventions for representing text in the host environment. Thus, there need not be a one- to-one correspondence between the characters in a stream and those in the external representation. [...]

(C2011 7.21.2/2)

For some implementations, such as POSIX-compliant ones, this is a distinction without a difference. For other implementations, such as those targeting Windows, the difference matters. In particular, on Windows, text streams convert on the fly between carriage-return / line-feed pairs in the external representation and newlines (only) in the internal representation.

The b in your fopen() mode specifies that the file should be opened as a binary stream -- that is, no translation will be performed on the bytes read from the file. Whether this is the right thing to do depends on your environment and the application's requirements. This is moot on Linux or another Unix, however, as there is no observable difference between text and binary streams on such systems.

dos2unix converts carriage-return / line-feed pairs in the input file to single line-feed (newline) characters. This will convert a Windows-style text file or one with mixed Windows / Unix line terminators to Unix text file convention. It is irreversible if there are both Windows-style and Unix-style line terminators in the file, and it is furthermore likely to corrupt your file if it is not a text file in the first place.

If your inputs are sometimes binary files then opening in binary mode is appropriate, and conversion via dos2unix probably is not. If that's the case and you also need translation for text-file line terminators, then you first and foremost need a way to distinguish which case applies for any particular file -- for example, by command-line argument or by pre-analyzing the file via libmagic. You then must provide different handling for text files; your main options are

Perform the line terminator conversion in your own code.
Provide separate versions of the fetchFile() function for text and binary files.

Ok I think I will go with doing the line terminator conversion while parsing. It seem to be the best way to do that. — user1527152, Jul 05 '17 at 17:00

score 0 · Answer 2 · answered Jul 05 '17 at 14:24

0

The code just copies the contents of a file to an allocated buffer. The UNIX way (YMMV) is to just memory map the file instead of reading it. Much faster.

// untested code
void* mapfile(const char *name)
{
    int fd;
    struct stat st;

    if ((fd = open(name, O_RDONLY)) == -1)
        return NULL;
    if (fstat(fd, &st)) {
        close(fd);
        return NULL;
    }

    void *p = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, 0, fd);
    close(fd);
    if (p == (void *)MAP_FAILED)
        p = NULL;

    return p;
}

Something along these lines will work. Adjust settings if you want to write to the file as well.

answered Jul 05 '17 at 14:24

Bjorn A.

1,148
9
12

1

That doesn't really answer the question. – Jabberwocky Jul 05 '17 at 14:27
Maybe, maybe not. The code does the same and eliminates the \r\n issue. So now the OP has one more option, which is good. – Bjorn A. Jul 05 '17 at 14:29
Meh, there's one difference, the zero terminator. So my alternative must return the length of the file too. – Bjorn A. Jul 05 '17 at 14:43
1

@BjornA. : How does it eliminate the `\r\n` issue? The code will still need to cope with files containing either CR or CR+LF. (which is not that hard a problem in any case). – Clifford Jul 05 '17 at 14:45
We don't have the full program available, so I'm just guessing here. I assume that the buffer is read/parsed right after the file contents were copied into the buffer. The parser can handle \r\n, can't it? The parser can just ignore the \r and 'detect' newline when it finds a \n. I'm not sure, but I think sscanf() can be used to skip \r (as it is white space anyway) and 'eat' the \n if \n is present in the format string. – Bjorn A. Jul 05 '17 at 14:53
@BjornA. : That kind of my point - this is a comment on the method used to read the file, not an answer to the question of how to handle different text files. I think the answer is just to us etext mode, the windows files will be stripped of CR. If the files written back and used on Windows, there are few Windows applications that cannot cope with Linux text files. (Notepad being an exception!). – Clifford Jul 05 '17 at 15:00
As I said, we're guessing here. :) OP's snippet needed some love and I'm pretty sure the rest of the program needs love too. Think of my reply as an outside-the-box approach with better, overall code as the end goal. Probably won't work. The thing is, the b in "rb" is ignored on Linux and no autoconversion will occur. If OP wants to read files with different formats and line endings, OP's program must handle it itself. – Bjorn A. Jul 05 '17 at 15:07

Open non text file without windows line ending

2 Answers2