-3

We have a code that reads files line by line and each line is stored as a string. The problem is that Linux adds '\n' at the end of line; windows adds '\r\n' and ios adds '\r'at the end of the line. We want to replace all these special characters by '\0' so that we get our string.

Is there any other character added by any other OS? We need to handle that.

In response to comments below, i tried using text file mode

file = fopen ( filename, "rt" );
while(fgets ( buf, sizeof buf, buf ) != NULL)
{
    (*properties)->url= (char *) malloc(strlen(buf)+1);
    strcpy( (*properties)->url, buf);
/////...................more code

}

//line x-->
strncpy(url,properties->url,strlen(properties->url));

at line x, gdb prints "https://example.com/file\r\n\0" for properties->url

where my url is "https://example.com/file"

Sateesh
  • 7
  • 1
  • 2
    Maybe [this will help you](https://stackoverflow.com/questions/2693776/removing-trailing-newline-character-from-fgets-input). check the bounty-awarded answer. – machine_1 May 09 '18 at 09:57
  • 1
    unfortunately there is little way to cover all the unknow exotic OS ( those designed by some obscur raving maniac ). You should probably only limit your scope to well established OS first. – dvhh May 09 '18 at 09:57
  • 3
    *Is there any other character added by any other OS? We need to handle that.* Why? The C standard has a clearly-defined "text" mode for streams that translates all newlines into `\n` characters, – Andrew Henle May 09 '18 at 10:02
  • Your problem is that you are reading it as a binary file. If you read or write it as text file then you don't care how the OS interprets the '\n'. – малин чекуров May 09 '18 at 10:04
  • And good luck replacing the OS-supplied record (line) boundaries with NULs on a [`RECFM=VB` dataset on IBM's MVS/OS](https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_159.htm). – Andrew Henle May 09 '18 at 10:05
  • hi i updated code to read ytext mode but problem does not go – Sateesh May 09 '18 at 11:36
  • The so-called "universal newline" set (see e.g. the Wikipedia article on [newlines](https://en.wikipedia.org/wiki/Newline)) is pretty much `\r\n`, `\n\r`, `\r`, and `\n`. The others use non-ASCII/UTF-8 -compatible character sets, and will have to be converted anyway. – Nominal Animal May 09 '18 at 11:38

1 Answers1

1

There are three relatively common newline conventions: \r\n, \n, and \r, and a fourth one that can occur when an editor gets confused about the newline convention, \n\r. If an approach supports universal newlines, it supports all four simultaneously, even if fixed.

Reading files line by line with universal newline support is easy. The only problem is that interactive input from line-buffered sources looks like it is read one line late. To avoid that, one can read lines into a dynamic buffer up to, but not including the newline; and consume the newline when reading the next line. For example:

#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

ssize_t  getline_universal(char **dataptr, size_t *sizeptr, FILE *in)
{
    char   *data = NULL;
    size_t  size = 0;
    size_t  used = 0;
    int     c;

    if (!dataptr || !sizeptr || !in) {
        errno = EINVAL;
        return -1;
    }

    if (*sizeptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        *dataptr = data;
        *sizeptr = size;
    }

    /* Ensure there are at least 2 chars available. */
    if (size < 2) {
        size = 2;
        data = malloc(size);
        if (!data) {
            errno = ENOMEM;
            return -1;
        }
        *dataptr = data;
        *sizeptr = size;
    }

    /* Consume leading newline. */
    c = fgetc(in);
    if (c == '\n') {
        c = fgetc(in);
        if (c == '\r')
            c = fgetc(in);
    } else
    if (c == '\r') {
        c = fgetc(in);
        if (c == '\n')
            c = fgetc(in);
    }

    /* No more data? */
    if (c == EOF) {
        data[used] = '\0';
        errno = 0;
        return -1;
    }

    while (c != '\n' && c != '\r' && c != EOF) {

        if (used + 1 >= size) {
            if (used < 7)
                size = 8;
            else
            if (used < 1048576)
                size = (3 * used) / 2;
            else
                size = (used | 1048575) + 1048577;

            data = realloc(data, size);
            if (!data) {
                errno = ENOMEM;
                return -1;
            }

            *dataptr = data;
            *sizeptr = size;
        }

        data[used++] = c;
        c = fgetc(in);
    }

    /* Terminate line. We know used < size. */
    data[used] = '\0';

    /* Do not consume the newline separator. */
    if (c != EOF)
        ungetc(c, in);

    /* Done. */
    errno = 0;
    return used;
}

The above function works much like POSIX.1-2008 getline(), except that it supports all four newline conventions (even mixed), and that it omits the newline from the line read. (That is, the newline is not included in either the return value or the dynamically allocated buffer. The newline is left in the stream, and consumed by the next getline_universal() operation.)

Unlike standard functions, getline_universal() always sets errno: to zero if successful, and nonzero otherwise. If you don't like the behaviour, feel free to change that.

As an use case example:

int main(void)
{
    unsigned long  linenum = 0u;
    char          *line_buf = NULL;
    size_t         line_max = 0;
    ssize_t        line_len;

    while (1) {
        line_len = getline_universal(&line_buf, &line_max, stdin);
        if (line_len < 0)
            break;

        linenum++;

        printf("%lu: \"%s\" (%zd chars)\n", linenum, line_buf, line_len);
        fflush(stdout);
    }

    if (errno) {
        fprintf(stderr, "Error reading from standard input: %s.\n", strerror(errno));
        return EXIT_FAILURE;
    }

    /* Not necessary before exiting, but here's how to
       safely discard the line buffer: */
    free(line_buf);
    line_buf = NULL;
    line_max = 0;
    line_len = 0;

    return EXIT_SUCCESS;
}

Note that because free(NULL) is safe, you can discard the buffer (using free(line_buf); line_buf = NULL; line_max = 0;) before any call to getline_universal(&line_buf, &line_max, stream).

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Any reason why you can't simply do `bool new_line = false; if(isspace(ch)) { while(ch == '\n' || ch == '\r') { new_line = true; ch = fgetc(in); } }` ? – Lundin May 09 '18 at 11:52
  • @Lundin: That would skip (consume) empty lines, too. – Nominal Animal May 09 '18 at 12:36
  • Not if you check the provided bool variable. – Lundin May 09 '18 at 12:43
  • @Lundin: The loop `while (ch=='\n' || ch=='\r') { new_line=true; ch=fgetc(in); }` most definitely consumes empty lines. That is why I didn't use it. As written, the code in my answer works fine even with line-buffered interactive input, and does not skip empty lines. It's only downside is that it consumes the newlines without telling the caller exactly what kind of newline there was. – Nominal Animal May 09 '18 at 14:35
  • "function works much like ... getline()" has additional differences: It can set `errno = 0` - most unexpected. It leaves the end-of-line character in the stream. It returns -1 if the last line is an empty line. – chux - Reinstate Monica Jun 27 '18 at 16:25
  • @chux: I added explicit mentions of `errno` and newline character behaviour. However, it does not return -1 if the last line is empty, unless the entire stream was empty. (When you read a line, the newline is left in the stream. The next call consumes it. If there is no further data, the previous line was the last line in the file. If there is data but no newline, this call will return the data, and next call -1. If there is data and a newline, this call will return the data, next call will consume the newline and return -1.) – Nominal Animal Jun 27 '18 at 16:49
  • The weird newline behaviour is necessary for handling universal newlines but still treat line-buffered interactive input sanely. – Nominal Animal Jun 27 '18 at 16:50