I'm reading stdin
and there are sometimes unix-style and sometimes windows-style newlines.
How to consume either type of newline?
I'm reading stdin
and there are sometimes unix-style and sometimes windows-style newlines.
How to consume either type of newline?
Assuming you know there will be a newline, the solution is to consume one character, and then decide:
10 - LF ... Unix style newline
13 - CR ... Windows style newline
If it's 13
, you have to consume one more character (10)
const char x = fgetc(stdin); // Consume LF or CR
if (x == 13) fgetc(stdin); // consume LF
There are a few more newline conventions than that. In particular, all four involving CR \r
and LF \n
-- \n
, \r
, \r\n
, and \n\r
-- are actually encoutered in the wild.
For reading text input, possibly interactively, and supporting all of those four newline encodings at the same time, I recommend using a helper function something like the following:
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
size_t get_line(char **const lineptr, size_t *const sizeptr, char *const lastptr, FILE *const in)
{
char *line;
size_t size, have;
int c;
if (!lineptr || !sizeptr || !in) {
errno = EINVAL; /* Invalid parameters! */
return 0;
}
if (*lineptr) {
line = *lineptr;
size = *sizeptr;
} else {
line = NULL;
size = 0;
}
have = 0;
if (lastptr) {
if (*lastptr == '\n') {
c = getc(in);
if (c != '\r' && c != EOF)
ungetc(c, in);
} else
if (*lastptr == '\r') {
c = getc(in);
if (c != '\n' && c != EOF)
ungetc(c, in);
}
*lastptr = '\0';
}
while (1) {
if (have + 2 >= size) {
/* Reallocation policy; my personal quirk here.
* You can replace this with e.g. have + 128,
* or (have + 2)*3/2 or whatever you prefer. */
size = (have | 127) + 129;
line = realloc(line, size);
if (!line) {
errno = ENOMEM; /* Out of memory */
return 0;
}
*lineptr = line;
*sizeptr = size;
}
c = getc(in);
if (c == EOF) {
if (lastptr)
*lastptr = '\0';
break;
} else
if (c == '\n') {
if (lastptr)
*lastptr = c;
else {
c = getc(in);
if (c != EOF && c != '\r')
ungetc(c, in);
}
break;
} else
if (c == '\r') {
if (lastptr)
*lastptr = c;
else {
c = getc(in);
if (c != EOF && c != '\n')
ungetc(c, in);
}
break;
}
if (iscntrl(c) && !isspace(c))
continue;
line[have++] = c;
}
if (ferror(in)) {
errno = EIO; /* I/O error */
return 0;
}
line[have] = '\0';
errno = 0; /* No errors, even if have were 0 */
return have;
}
int main(void)
{
char *data = NULL;
size_t size = 0;
size_t len;
char last = '\0';
setlocale(LC_ALL, "");
while (1) {
len = get_line(&data, &size, &last, stdin);
if (errno) {
fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
if (!len && feof(stdin))
break;
printf("Read %lu characters: '%s'\n", (unsigned long)len, data);
}
free(data);
data = NULL;
size = 0;
return EXIT_SUCCESS;
}
Except for the errno
constants I used (EINVAL
, ENOMEM
, and EIO
), the above code is C89, and should be portable.
The get_line()
function dynamically reallocates the line buffer to be long enough when necessary. For interactive inputs, you must accept a newline at the first newline-ish character you encounter (as trying to read the second character would block, if the first character happens to be the only newline character). If specified, the one-character state at lastptr
is used to detect and handle correctly any two-character newlines at the start of the next line read. If not specified, the function will attempt to consume the entire newline as part of the current line (which is okay for non-interactive inputs, especially files).
The newline is not stored or counted in the line length. For added ease of use, the function also skips non-whitespace control characters. Especially embedded nul characters (\0
) often cause headaches, so having the function skip those altogether is often a robust approach.
As a final touch, the function always sets errno
-- to zero if no error occurred, nonzero error code otherwise --, including ferror()
cases, so detecting error conditions is trivial.
The above code snippet includes a main()
, which reads and displays input lines, using the current locale for the meaning of "non-whitespace control character" (!isspace(c) && iscntrl(c)
).
Although this is definitely not the fastest mechanism to read input, it is not that slow, and it is a very robust one.
Questions?