1

I want to write a program, that reads a very large csv file. I want the file to read the columns by name and then print the entirety of the column. However it only prints out one of the columns in the datalist. So it only prints out the unix timestamp columns out of the entirety of the program. I want the code to be able to print out the other columns as well Unix Timestamp,Date,Symbol,Open,High,Low,Close,Volume BTC,Volume USD

csv file:

Unix Timestamp,Date,Symbol,Open,High,Low,Close,Volume BTC,Volume USD
1605139200.0,2020-11-12,BTCUSD,15710.87,15731.73,15705.58,15710.01,1.655,26014.29
1605052800.0,2020-11-11,BTCUSD,15318,16000,15293.42,15710.87,1727.17,27111049.25
1604966400.0,2020-11-10,BTCUSD,15348.2,15479.49,15100,15318,1600.04,24521694.72
1604880000.0,2020-11-09,BTCUSD,15484.55,15850,14818,15348.2,2440.85,37356362.78
1604793600.0,2020-11-08,BTCUSD,14845.5,15672.1,14715.98,15484.55,987.72,15035324.13

Current code:

#include<stdio.h>
#include<stdlib.h>
void main()
{
    char buffer[1001]; //get line
    float timestampfile;
    FILE *fp;
    int i=1; //line
    fp = fopen("filename.csv", "r"); //used to read csv
    if(!fp)
    {
        printf("file not found"); //file not found
        exit(0);
    }
    fgets(buffer,1000, fp); //read line
    printf("Expected output print the first column:\n");
    while(feof(fp) == 0)
    {
        sscanf(buffer,"%f",&timestampfile); //read data line
        printf("%d: %f\n",i,timestampfile); //used to print data
        i++;
        fgets(buffer, 1000, fp);
    }
    printf("end of the column");
    fclose(fp);
}

Current output:

1: 1605139200.000000
2: 1605052800.000000
3: 1604966400.000000
4: 1604880000.000000
5: 1604793600.000000
end of the column
MDXZ
  • 160
  • 12
fire fireeyyy
  • 71
  • 1
  • 8
  • 1
    Do you want to store the whole line in a single char array or you prefer to store each column in a separate variable? – Sam______ Dec 30 '20 at 06:45
  • 1
    With `sscanf(buffer,"%f",&timestampfile)` are you only parsing the timestamp of a row. Try `"%s\n"` as the format string (maybe without `\n`) – Ackdari Dec 30 '20 at 07:03
  • 1
    If you want to get all columns, you must tokenize the line, that is split it at the separating commas. (If you don't have empty fields, `strtok` will work.) That will give you an array of strings, If you want to do further operations on the data, you will have to convert these strings to the approriate data type. (`strtol` converts to long int, `strtod` converts to double, `strptime` converts to date/time data.) Alternatively, `sscanf` a more complex format that suits your data. In each case, make sure to handle badly formatted data. – M Oehm Dec 30 '20 at 07:07
  • 1
    There is no `-1` in the character count with `fgets()`, e.g. `fgets(buffer,sizeof buffer, fp);` is fine. Hint, keeping allocations 8-byte aligned can help your compiler optimize stack space. For your arrays size, `1024` is a good power-of-two choice that is 8-byte aligned. (Note the compiler is free to reserve some minimum block size regardless of what you request) You will want to look at [**Why is while ( !feof (file) ) always wrong?**](https://stackoverflow.com/questions/5431941/why-is-while-feoffile-always-wrong) – David C. Rankin Dec 30 '20 at 17:03
  • 1
    Better to control your read-loop with the return of your read function, e.g. `while (fgets(buffer, sizeof buffer, fp)) { ... }` – David C. Rankin Dec 30 '20 at 17:06
  • regarding: `void main()` There are only two valid signatures for `main()` (regardless of what some non-compilant compilers might allow) They are: `int main( void )` and `int main( int argc, char *argv[] )` – user3629249 Dec 31 '20 at 08:06

2 Answers2

0

You have started out in the right direction, but you have stumbled a bit in handling separating the comma separated values. The standard C library provides all you need to handle separating the values.

Simple Implementation Using strtok()

The easiest implementation would be to take the filename to read and the index of column to extract as the first two arguments to your program. Then you could simply discard the heading row and output the requested value for the column index. That could be done with a simple loop that keeps track of the token number while calling strtok(). Recall on the first call to strtok() the variable name for the string is passed as the first parameter, ever successive call passes NULL as the first argument until no more tokens are found.

A short example would be:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXC 1024       /* if you need a constant, #define one (or more) */
#define DELIM ",\n"

int main (int argc, char **argv) {
    
    if (argc < 3) { /* validate filename and column given as arguments */
        fprintf (stderr, "usage: %s filename column\n", argv[0]);
        return 1;
    }
    
    char buf[MAXC];                                 /* buffer to hold line */
    size_t ndx = strtoul (argv[2], NULL, 0);        /* column index to retrieve */
    FILE *fp = fopen (argv[1], "r");                /* file pointer */
    
    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    if (!fgets (buf, MAXC, fp)) {                   /* read / discard headings row */
        fputs ("error: empty file.\n", stderr);
        return 1;
    }
    
    while (fgets (buf, MAXC, fp)) {                 /* read / validate each line */
        char *p = buf;
        size_t i = 0;
        /* loop until the ndx token found */
        for (p = strtok(p, DELIM); p && i < ndx; p = strtok (NULL, DELIM))
            i++;
        if (i == ndx && p)  /* validate token found */
            puts (p);
        else {              /* handle error */
            fputs ("error: invalid index\n", stderr);
            break;
        }
    }
}

(note: strtok() considers multiple delimiters as a single delimiter. It cannot be used when empty fields are a possibility such as field1,field2,,field4,.... strsep() was suggested as a replacement for strtok() and it does handle empty-fields, but has shortcomings of its own.)

Example Use/Output

first column (index 0):

$ ./bin/readcsvbycol_strtok dat/largecsv.csv 0
1605139200.0
1605052800.0
1604966400.0
1604880000.0
1604793600.0

second column (index 1)

$ ./bin/readcsvbycol_strtok dat/largecsv.csv 1
2020-11-12
2020-11-11
2020-11-10
2020-11-09
2020-11-08

thrid column (index 2)

$ ./bin/readcsvbycol_strtok dat/largecsv.csv 2
BTCUSD
BTCUSD
BTCUSD
BTCUSD
BTCUSD

forth column (index 3)

$ ./bin/readcsvbycol_strtok dat/largecsv.csv 3
15710.87
15318
15348.2
15484.55
14845.5

request out of range:

$ ./bin/readcsvbycol_strtok dat/largecsv.csv 9
error: invalid index

More Involved Example Displaying Headings as Menu

If you wanted to provide a short interface for the user to choose which column to output, you could count the columns available. You can determine the number of commas present (and adding one more provides the number of columns). You can then save the headings to allow the user to select which column to output by allocating column number of pointers and then by allocating storage for each heading and copying the heading to the storage. You can then display the headings as a menu for the user to select from.

After determining which column to print, you simply read each line into your buffer, and then tokenize the line with either strtok() or strcspn() (the downside to strtok() is that it modifies the buffer, so if you need to preserve it, make a copy). strcspn() returns the length of the token, so it provides the advantage of not modifying the original and providing the number of characters in the token. Then you can output the column value and repeat until you run out of lines.

An example would be:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXC 1024       /* if you need a constant, #define one (or more) */

int main (int argc, char **argv) {
    
    char buf[MAXC], *p = buf, **headings = NULL;
    size_t cols = 1, ndx = 0, nchr;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
    
    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }
    
    if (!fgets (buf, MAXC, fp)) {                       /* read / validate headings row */
        fputs ("error: empty file.\n", stderr);
        return 1;
    }
    
    while (*p && (p = strchr (p, ','))) {               /* loop counting ',' */
        cols++;
        p++;
    }
    p = buf;    /* reset p to start of buf */
    
    /* allocate cols pointers for headings */
    if (!(headings = malloc (cols * sizeof *headings))) {
        perror ("malloc-heading pointers");
        return 1;
    }
    
    /* loop separating headings, allocate/assign storage for each, copy to storage */
    while (*p && *p != '\n' && (nchr = strcspn (p, ",\n"))) {
        if (!(headings[ndx] = malloc (nchr + 1))) {     /* allocate/validate */
            perror ("malloc headings[ndx]");
            return 1;
        }
        memcpy (headings[ndx], p, nchr);                /* copy to storage */
        headings[ndx++][nchr] = 0;                      /* nul-terminate */
        p += nchr+1;                                    /* advance past ',' */
    }
    
    if (ndx != cols) {  /* validate ndx equals cols */
        fputs ("error: mismatched cols & ndx\n", stderr);
        return 1;
    }
    
    puts ("\nAvailable Columns:");                      /* display available columns */
    for (size_t i = 0; i < cols; i++)
        printf (" %2zu) %s\n", i, headings[i]);
    while (ndx >= cols) {                               /* get / validate selection */
        fputs ("\nSelection: ", stdout);
        if (!fgets (buf, MAXC, stdin)) {                /* read input (same buffer) */
            puts ("(user canceled input)");
            return 0;
        }
        if (sscanf (buf, "%zu", &ndx) != 1 || ndx >= cols)  /* convert/validate */
            fputs ("  error: invalid index.\n", stderr);
    }
    
    printf ("\n%s values:\n", headings[ndx]);           /* display column name */
    
    while (fgets (buf, MAXC, fp)) {                     /* loop displaying column */
        char column[MAXC];
        p = buf;
        /* skip forward ndx ',' */
        for (size_t col = 0; col < ndx && (p = strchr (p, ',')); col++, p++) {}
        /* read column value into column */
        if ((nchr = strcspn (p, ",\n"))) {
            memcpy (column, p, nchr);                   /* copy */
            column[nchr] = 0;                           /* nul-terminate */
            puts (column);                              /* output */
        }
    }
    
    if (fp != stdin)   /* close file if not stdin */
        fclose (fp);
    
    for (size_t i = 0; i < cols; i++)   /* free all allocated memory */
        free (headings[i]);
    free (headings);
}

Example Use/Output

$ ./bin/readcsvbycol dat/largecsv.csv

Available Columns:
  0) Unix Timestamp
  1) Date
  2) Symbol
  3) Open
  4) High
  5) Low
  6) Close
  7) Volume BTC
  8) Volume USD

Selection: 1

Date values:
2020-11-12
2020-11-11
2020-11-10
2020-11-09
2020-11-08

Or the open values:

$ ./bin/readcsvbycol dat/largecsv.csv

Available Columns:
  0) Unix Timestamp
  1) Date
  2) Symbol
  3) Open
  4) High
  5) Low
  6) Close
  7) Volume BTC
  8) Volume USD

Selection: 3

Open values:
15710.87
15318
15348.2
15484.55
14845.5

Column out of range canceling input with Ctrl + d (Ctrl + z on windows):

$ ./bin/readcsvbycol dat/largecsv.csv

Available Columns:
0) Unix Timestamp
1) Date
2) Symbol
3) Open
4) High
5) Low
6) Close
7) Volume BTC
8) Volume USD

Selection: 9
error: invalid index.

Selection: (user canceled input)

Both approaches accomplish the same thing, it all depends on your program needs. Look things over and let me know if you have further questions.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • A pretty thorough answer. You might add that the `strtok` approach is inappropriate if any of the field values are empty because `strtok()` consider any sequence of separators as a single separator. This is OK for white space, but leads to potentially incorrect results for other cases such as `,` or TAB separated values. – chqrlie Dec 30 '20 at 21:13
  • Good point. That is another reason the `strtok()`, `strsep()` struggle goes on .. and on .. and you know the rest. Will update (done and noted). – David C. Rankin Dec 30 '20 at 21:46
  • this looks like a very fine piece of code that could resolve this problem but i am having issues with placing the filename to be read within the code. Could you explain where you have placed the `largecsv.csv` – fire fireeyyy Dec 31 '20 at 00:31
  • Sure, if you look at `FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;` the file being opened is the file provided as the first argument to the program. So if you look at my examples, I run `./bin/readcsvbycol dat/largecsv.csv` which is just `progname filename`. You never hardcode filenames (though setting a default is okay). You either get your filename as an argument to the program or by asking the user to input the filename. It is convenient to use the `argv` arguments to `main()` to pass such information (that's what it is for), otherwise prompt and have the user input the filename. – David C. Rankin Dec 31 '20 at 00:36
  • This is for the 2nd version that does not require the column number as an argument. Here `FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;` will read from the filename given (if one is given), but will read from `stdin` by default if no argument is give (like all Linux utilities do). That allows you to pipe information to the program. In the first example, `FILE *fp = fopen (argv[1], "r");` is used because the `if()` statement above already checks that there are sufficient arguments. – David C. Rankin Dec 31 '20 at 00:38
  • @DavidC.Rankin I am trying to run the file from a different directory it doesnt seem to work. I am quite new to setting up files. – fire fireeyyy Dec 31 '20 at 21:53
  • So long as you give the correct relative (or absolute) path as the first argument (and in the first example also the column index to read), the file will be opened no matter where it is on your system. If you are on windows, make sure the file you are trying to read is an ASCII file, not some UTF-16 encoded file like the VS Code or VS Editor produces by default. I suspect that may be an issue. Use an editor that will allow you to set the character set and line-endings. (simple Linux `'\n'` instead of DOS `"\r\n"`) Geany and Notepad++ are both good. – David C. Rankin Jan 01 '21 at 00:46
0

In order to extract more than one field by name, you must get the names of the fields to extract, for example as command line arguments, determine the corresponding columns, and for each line of the CSV file, output the requested columns.

Below is a simple program that extracts columns from a CSV file and produces another CSV file. It does not use strtok() nor strchr() but analyses the line one character at a time to find the starting and ending offset of the columns and acts accordingly. The source file is passed as redirected input and the output can be redirected to a different CSV file.

Here is the code:

#include <stdio.h>
#include <string.h>

int find_header(const char *line, const char *name) {
    int len = strlen(name);
    int i, n, s;
    for (i = n = s = 0;; i++) {
        if (line[i] == ',' || line[i] == '\n' || line[i] == '\0') {
            if (len == i - s && !memcmp(line + s, name, len))
                return n;
            if (line[i] != ',')
                return -1;
            s = i + 1;
            n++;
        }
    }
}

int main(int argc, char *argv[]) {
    char buffer[1002];
    int field[argc];
    char *name[argc];
    int i, n;

    if (argc < 2) {
        printf("usage: csvcut FIELD1 [FIELD2 ...] < CSVFILE\n");
        return 2;
    }

    // read the input header line
    if (!fgets(buffer, sizeof buffer, stdin)) {
        fprintf(stderr, "missing header line\n");
        return 1;
    }
    // determine which columns to extract
    for (n = 0, i = 1; i < argc; i++) {
        int f = find_header(buffer, argv[i]);
        if (f < 0) {
            fprintf(stderr, "field not found: %s\n", argv[i]);
        } else {
            name[n] = argv[i];
            field[n] = f;
            n++;
        }
    }
    // output new header line
    for (i = 0; i < n; i++) {
        if (i > 0)
            putchar(',');
        printf("%s", name[i]);
    }
    putchar('\n');
    // parse the records, output the selected fields
    while (fgets(buffer, sizeof buffer, stdin)) {
        for (i = 0; i < n; i++) {
            int j, s, f, start, length;
            if (i > 0)
                putchar(',');
            // find field boundaries
            for (j = s = f = start = length = 0;; j++) {
                if (buffer[j] == ',' || buffer[j] == '\n' || buffer[j] == '\0') {
                    if (f == field[i]) {
                        start = s;
                        length = j - s;
                        break;
                    }
                    if (buffer[j] != ',')
                        break;
                    s = j + 1;
                    f++;
                }
            }
            printf("%.*s", length, buffer + start);
        }
        putchar('\n');
    }
    return 0;
}

Sample run:

./csvcut Date Close < sample.csv
Date,Close                                                                                                                                                                                 2020-11-12,15710.01
2020-11-11,15710.87
2020-11-10,15318
2020-11-09,15348.2
2020-11-08,15484.55

Note that fields cannot contain embedded commas. The program could be extended to handle quoted contents to support these.

chqrlie
  • 131,814
  • 10
  • 121
  • 189