Reading CSV (Comma-Separated Value) files is hard for the general case, where fields can be embedded in double quotes and can then contain commas and doubled-up double quotes to embed a double quote, and where a single field can extend over multiple lines.
In your data, you don't have to worry about those special cases. Instead, you've imposed an inconsistency because you split the name field into two based on the space separating them. As long as you don't have "Alice Betty Clarke" as a name in the data, you can still do it.
You attempt to use:
read = fscanf(file, "%s,%s,%f,%f,%f\n",
students[records].Name.first,
students[records].Name.last,
&students[records].grades[0],
&students[records].grades[1],
&students[records].grades[2]);
This alone has multiple problems:
- You attempt to read the names separated by a comma, but they're separated by a space.
- You put a newline (white space) at the end of the format string.
- The second
%s
will read up to white space, which means it will gobble up the comma and the numbers.
- You don't prevent buffer overflows from overlong names.
The solutions to these problems are:
This is easily fixed — replace the first comma in the format string with a blank (or omit it altogether: "%s%s"
reads two words separated by white space).
See What is the effect of trailing white space in a scanf() format string? Where you are reading from a file, as in your code, it isn't quite as serious as if you are reading from the user's typing at the terminal — but when the input is from the terminal, trailing white space in a format string is a catastrophic UI/UX blunder. The fix is trivial — omit the \n
from the format string. The next call will skip leading white space, including newlines left over from the prior call.
Use a negated scan set: %[^,]
. You could use that in place of the first field for simplicity and consistency.
Limit the length of the inputs: "%19[^, ] %19[^, ],%f,%f,%f"
. Note that there are three conversion specifiers that do not skip leading white space, and they are %c
, %[…]
(scan sets) and %n
. When using the scan sets, it is necessary to include the white space between the conversion specifications.
You have experimented with various values for your:
if (read == 4)
records++;
Since you are attempting to read 5 values, you should test for 5; if you don't get 5, there is either EOF (return value EOF), some sort of encoding error (unlikely, but the return value would also be EOF), or a data format error (the return value is in the range 0..4). You should exit the loop on receiving EOF. With a data format error, if you want to continue, you should probably read and ignore data up to the next newline:
int c;
while ((c = getchar()) != EOF && c != '\n')
;
It may be more sensible to abandon ship immediately. Alternatively, count the number of erroneous records, read the rest of the file so further erroneous records can be reported, and probably abandon further processing after EOF is finally detected.
You should ensure that you don't try to read more records than will fit in the array.
You can improve the error reporting by reading whole lines using fgets()
or POSIX getline()
and then passing the line to sscanf()
. Note that if you do this, you might want to check for garbage after the third number, probably using the %n
conversion specification to identify where the conversions stopped and ensuring that there are no non-blank characters after the number. The scanf()
family of functions do not count the %n
conversions in the return value.
Note that error messages should be written to stderr
, not to stdout
. Also, you should not call a function that opens a file (such as fopen()
or open()
) with a string literal for the file name. You must check that the open succeeded, and if not, report the error (on standard error - stderr) and you should include the file name in the error message. To avoid repetition, you should pass a variable that points to the file name to the open function, and can then use that variable when formatting the error message too. You can use perror()
to report the problem if you don't have a better mechanism. For example:
const char *filename = "data.csv";
FILE *fp = fopen(filename, "r");
if (fp == NULL)
{
perror(filename);
exit(EXIT_FAILURE);
}
Putting all these changes and refinements together, you might end up with code like this:
#include <ctype.h>
#include <stdio.h>
#include <string.h>
struct Name
{
char first[20];
char last[20];
};
struct Student
{
struct Name name;
float grades[3];
float average;
};
static int trailing_white_space_only(const char *buffer)
{
unsigned char *data = (unsigned char *)buffer;
while (*data != '\0' && isspace(*data))
data++;
return *data == '\0';
}
int main(void)
{
const char *filename = "data.csv";
FILE *fp = fopen("data.csv", "r");
if (fp == NULL)
{
fprintf(stderr, "Error opening file '%s' for reading\n", filename);
return 1;
}
enum { MAX_STUDENTS = 5 };
struct Student students[MAX_STUDENTS];
int n_fields = 0;
int records = 0;
int lineno = 0;
int fail = 0;
char buffer[2048];
while (records < MAX_STUDENTS && fgets(buffer, sizeof(buffer), fp) != NULL)
{
buffer[strcspn(buffer, "\n")] = '\0';
lineno++;
int offset = 0;
n_fields = sscanf(buffer, "%19[^, ] %19[^, ],%f,%f,%f%n",
students[records].name.first,
students[records].name.last,
&students[records].grades[0],
&students[records].grades[1],
&students[records].grades[2],
&offset);
if (n_fields == 5)
{
if (trailing_white_space_only(&buffer[offset]))
records++;
else
{
fprintf(stderr, "Trailing junk on line %d\n [%s]\n",
lineno, buffer);
fail++;
}
}
else
{
fail++;
fprintf(stderr, "Format error on line %d (field %d)\n [%s]\n",
lineno, n_fields + 1, buffer);
}
}
fclose(fp);
if (fail == 0)
printf("\n%d records read successfully.\n\n", records);
else
printf("\n%d records read successfully (and %d invalid records "
"were discarded).\n\n", records, fail);
for (int i = 0; i < records; i++)
{
char name[sizeof(struct Name)];
snprintf(name, sizeof(name), "%.19s %.19s",
students[i].name.first, students[i].name.last);
printf("%-39s %6.2f %6.2f %6.2f\n", name,
students[i].grades[0],
students[i].grades[1],
students[i].grades[2]);
}
printf("\n");
return 0;
}
With the data file data.csv
from the question, the output is:
4 records read successfully.
Iskandar Kholmatov 100.00 100.00 100.00
George Washington 90.00 50.00 100.00
Dennis Ritchie 90.00 0.00 10.00
Bill Gates 60.00 50.00 77.00
Now consider this variant data file, which has bad data on lines 3, 5 and 6:
Iskandar Kholmatov,100,100,100
George Washington,90,50,100
Garbage Disposal,read,me,a,riddle
Dennis Ritchie,90,0,10
Steve Jobs,60,70,80,
Betty Alice Clarke,94,95,97
Bill Gates,60,50,77
The output is:
Format error on line 3 (field 3)
[Garbage Disposal,read,me,a,riddle]
Trailing junk on line 5
[Steve Jobs,60,70,80,]
Format error on line 6 (field 3)
[Betty Alice Clarke,94,95,97]
4 records read successfully (and 3 invalid records were discarded).
Iskandar Kholmatov 100.00 100.00 100.00
George Washington 90.00 50.00 100.00
Dennis Ritchie 90.00 0.00 10.00
Bill Gates 60.00 50.00 77.00
There are still many ways you might improve the program. For example, if there are more records in the file than fit in the array, you could read and diagnose the excess records (reporting errors too). Or you could revise the code to dynamically allocate the array of students and grow the array when necessary.