2

I have over 100,000 csv files in the below format:

1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

All I need is field 10 and field 17 onward, field 10 is the counter indicate how many integer stored start from field 17 i.e. what I need is:

6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32

Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience, since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines. Therefore using Getline to read in the data and build a vector> may have serious performance issue for me. I tried to use fscanf to achieve this:

while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

However, this will call fscanf multiple times and may also create performance issue. Is there any way to read in variable number of integer at 1 call with fscanf ? Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which is better from performance point of view?

Chi-fung LAM
  • 195
  • 2
  • 11
  • "Therefore using Getline to read in the data and build a vector> may have serious performance issue for me. I tried to use fscanf to achieve this" -- the assumption that `fscanf()` can do something that `std::getline` prevents seems flawed to me. Why do you think so? Also, your final questions are invitations for opinions, which is generally considered off-topic. – Ulrich Eckhardt Apr 25 '18 at 19:29
  • `fscanf(fstream,"%*d",&MyIntArr[i]);` is wrong. Remove `*` – Jean-François Fabre Apr 25 '18 at 19:31
  • 5
    I would read a line with `fgets` then skip n commas (so no integer scan), then scan or tokenize from there. – Jean-François Fabre Apr 25 '18 at 19:31
  • 2
    https://stackoverflow.com/q/5431941/905902 `while (!feof(stream)){...}` – wildplasser Apr 25 '18 at 19:39
  • Your code doesn't skip to column 17. – jxh Apr 25 '18 at 19:42
  • 1
    What environment is this in? Are the files fixed, or do you need to do this more than once? I would consider Perl or mmap(2) for this. In any case, I would probably walk the lines manually and use strtol(3) in the right places to do the conversion. – user464502 Apr 25 '18 at 21:32
  • Thanks a lot, code corrected. Getline can perform what I want. just the speed... (see my previous question: https://stackoverflow.com/questions/50002563/read-in-large-csv-file-performance-issue-in-c ) – Chi-fung LAM Apr 26 '18 at 03:34
  • Can you show how you're measuring the speed? (Also, which optimizer options you're using) Function call overhead is unlikely to be *that* significant, and you should be able to skip columns with `%*[^,],`, which may be more efficient than matching integers. – Toby Speight Apr 26 '18 at 07:46
  • Do the CSV files have only integral numbers in them? – chqrlie Apr 26 '18 at 08:24
  • Don't waste your time on wild guesses. Profile. – n. m. could be an AI Apr 26 '18 at 08:57
  • As a data point, I wondered how necessary it was here to resort to mmap and custom code. I used two off-the-shelf tools to perform the desired task; it probably took me about a minute to perfect the recipe. I find that processing a 350,000 line file takes about 4 seconds. At that rate, 100,000 of those files would take almost 5 days. So in this case, I have to concede that the custom code was probably worth it. – Steve Summit Apr 27 '18 at 00:07
  • You are right, this exactly my case. I have over 200,000 files with 300,000 ~ 400,000 lines need to process (and more files are coming) – Chi-fung LAM Apr 27 '18 at 04:48

4 Answers4

2

So, there are at most 43 numbers per line. Even at 64 bits, each number is limited to 21 digits, so 1024 bytes is plenty for the max 946 bytes that a line could be (so long as there is no whitespace).

char line[1024];

while (fgets(line, sizeof(line), stdin) != NULL) {
    //...
}

A helper function to skip to the desired column.

const char *find_nth_comma(const char *s, int n) {
    const char *p = s;
    if (p && n) while (*p) {
        if (*p == ',') {
            if (--n == 0) break;
        }
        ++p;
    }
    return p;
}

So, inside your loop, skip to column 10 to find the first number of interest, and then skip to column 17 to start reading in the rest of the numbers. The completed loop looks like:

while (fgets(line, sizeof(line), stdin) != NULL) {
    const char *p = find_nth_comma(line, 9);
    char *end;
    assert(p && *p);
    MyCounter = strtol(p+1, &end, 10);
    assert(*end == ',');
    p = find_nth_comma(end+1, 6);
    assert(p && *p);
    for (int i = 0; i < MyCounter; ++i, p = end) {
        MyIntArray[i] = strtol(p+1, &end, 10);
        assert((*end == ',') ||
               (i == MyCounter-1) &&
               (*end == '\0' || isspace(*end & 0xFF)));
    }
}

This approach will work with a mmap solution as well. The fgets would be replaced with a function that points to the next line to be processed in the file. The find_nth_comma would need a modification to detect end of line/end of file rather than rely on a NUL terminated string. strtol would be changed with a custom function that again detects end of line or end of file. (The purpose of such changes is to remove any code that would require copying the data, which would be motivation for a mmap approach.)

With parallel processing, it is possible to parse multiple parts of the file simultaneously. But, it is probably sufficient to have different threads process different files, and then collate the results after all files have been processed.

jxh
  • 69,070
  • 8
  • 110
  • 193
  • For correctness, you should cast the `char` argument to `isspace()` to avoid undefined behavior: `isspace((unsigned char)*p)`. Also another potential issue: `find_nth_comma()` may return a pointer to the null terminator: `MyCounter = strtol(p+1, &p, 10);` would then have undefined behavior. Same problem with the next call. Consider changing the semantics to `skip_n_fields()` and return a pointer past the n-th comma. – chqrlie Apr 26 '18 at 08:19
  • I solved this issue by using memory map (see my below post). – Chi-fung LAM Apr 26 '18 at 14:58
2

Eventually I use memory mapped file to solve my problem (this solution is a side product of my previous problem, performance issue when reading big CSV file) read in large CSV file performance issue in C++

Since I work on MS Windows, so I use Stephan Brumme's "Portable Memory Mapping C++ Class" http://create.stephan-brumme.com/portable-memory-mapping/ Since I don't need to deal with file(s) > 2 GB, My implementation is simpler. For over 2GB file, visit the web to see how to handle.

Below please find my piece of code:

// may tried RandomAccess/SequentialScan
MemoryMapped MemFile(FilterBase.BaseFileName, MemoryMapped::WholeFile, MemoryMapped::RandomAccess);

// point to start of memory file
char* start = (char*)MemFile.getData();
// dummy in my case
char* tmpBuffer = start;

// looping counter
uint64_t i = 0;

// pre-allocate result vector
MyVector.resize(300000);

// Line counter
int LnCnt = 0;

//no. of field
int NumOfField=43;
//delimiter count, num of field + 1 since the leading and trailing delimiter are virtual
int DelimCnt=NoOfField+1;
//Delimiter position. May use new to allocate at run time
// or even use vector of integer
// This is to store the delimiter position in each line
// since the position is relative to start of file. if file is extremely
// large, may need to change from int to unsigner, long or even unsigned long long
static  int DelimPos[DelimCnt];

// Max number of field need to read usually equal to NumOfField, can be smaller, eg in my case, I only need 4 fields
// from first 15 field, in this case, can assign 15 to MaxFieldNeed
int MaxFieldNeed=NumOfField;
// keep track how many comma read each line
int DelimCounter=0;
// define field and line seperator
char FieldDelim=',';
char LineSep='\n';

// 1st field, "virtual Delimiter" position
DelimPos[CommaCounter]=-1
DelimCounter++;

// loop through the whole memory field, 1 and only once
for (i = 0; i < MemFile.size();i++)
{
  // grab all position of delimiter in each line
  if ((MemFile[i] == FieldDelim) && (DelimCounter<=MaxFieldNeed)){
    DelimPos[DelimCounter] = i;
    DelimCounter++;
  };

  // grab all values when end of line hit
  if (MemFile[i] == LineSep) {
    // no need to use if (DelimCounter==NumOfField) just assign anyway, waste a little bit
    // memory in integer array but gain performance 
    DelimPos[DelimCounter] = i;
    // I know exactly what the format is and what field(s) I want
    // a more general approach (as a CSV reader) may put all fields
    // into vector of vector of string
    // With *EFFORT* one may modify this piece of code so that it can parse
    // different format at run time eg similar to:
    // fscanf(fstream,"%d,%f....
    // also, this piece of code cannot handle complex CSV e.g.
    // Peter,28,157CM
    // John,26,167CM
    // "Mary,Brown",25,150CM
    MyVector.StrField = string(strat+DelimPos[0] + 1, strat+DelimPos[1] - 1);
    MyVector.IntField = strtol(strat+DelimPos[3] + 1,&tmpBuffer,10);
    MyVector.IntField2 = strtol(strat+DelimPos[8] + 1,&tmpBuffer,10);
    MyVector.FloatField = strtof(start + DelimPos[14] + 1,&tmpBuffer);
    // reset Delim counter each line
    DelimCounter=0
    // previous line seperator treat as first delimiter of next line
    DelimPos[DelimCounter] = i;
    DelimCounter++
    LnCnt++;
  }
}
MyVector.resize(LnCnt);
MyVector.shrink_to_fit();
MemFile.close();
};

I can code whatever I want inside:

  if (MemFile[i] == LineSep) {
  }

eg handle empty field, perform calculation etc. With this piece of code, I handle 2100 files (6.3 GB) in 57 seconds!!! (I code the CSV format in it and only grab 4 values in my previous case). Later will change this code to handle this issue. Thx all who help me in this issue.

Chi-fung LAM
  • 195
  • 2
  • 11
1

In order to maximize performance, you should map the files in memory with mmap or equivalent and parse the file with ad hoc code, typically scanning one character at a time with a pointer, checking for '\n' and/or '\r' for end of record and converting the numbers on the fly for storage to your arrays. The tricky parts are:

  • how do you allocate or otherwise handle the destination arrays.
  • are the fields all numeric? integral?
  • is the last record terminated by a newline? You can easily check this condition after the mmap call. The advantage is you only need check for end of file when you encounter a newline sequence.
chqrlie
  • 131,814
  • 10
  • 121
  • 189
1

Probably the easiest way to read a run-time determined number of integers is to point into the right part of a longer format string. In other words, we can have a format string with 28 %d, specifiers, but point to the nth one before the end of string and pass that pointer as the format string for scanf().

As a simple example, consider accepting 3 integers from a maximum of 6:

"%d,%d,%d,%d,%d,%d,"
          ^

The arrow shows the string pointer to use as the pattern argument.


Here's a full worked example; its runtime is about 8 seconds for 1 million iterations (10 million lines) when built with gcc -O3. It's slightly complicated by the mechanics to update the input string pointer, which is obviously not necessary when reading from a file stream. I've skipped the checking that nfields <= 28, but that's easily added.

char const *const input =
    "1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,\n"
    "1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,\n";

#include <stdio.h>

#define SKIP_FIELD "%*[^,],"
#define DECIMAL_FIELD "%d,"

int read()
{
    int n;             /* bytes read - not needed for file or stdin */
    int sum = 0;       /* just to make sure results are used */

    for (char const *s = input;  *s;  ) {
        int nfields;
        int array[28];
        int m = sscanf(s,
                       /* field 0 is missing */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       DECIMAL_FIELD /* field 10 */
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       SKIP_FIELD SKIP_FIELD SKIP_FIELD
                       "%n",
                       &nfields,
                       &n);
        if (m != 1) {
            return -1;
        }

        s += n;

        static const char fieldchars[] = DECIMAL_FIELD;
        static const size_t fieldsize = sizeof fieldchars - 1; /* ignore terminating null */

        static const char *const parse_entries =
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD DECIMAL_FIELD
            "[^\n] ";
        const char *const line_parse = parse_entries + (28-nfields) * fieldsize;

        /* now read nfields (max 28) */
        m = sscanf(s,
                   line_parse,
                   &array[0], &array[1], &array[2], &array[3],
                   &array[4], &array[5], &array[6], &array[7],
                   &array[8], &array[9], &array[10], &array[11],
                   &array[12], &array[13], &array[14], &array[15],
                   &array[16], &array[17], &array[18], &array[19],
                   &array[20], &array[21], &array[22], &array[23],
                   &array[24], &array[25], &array[26], &array[27]);
        if (m != nfields) {
            return -1;
        }

        /* advance stream position */
        sscanf(s, "%*[^\n] %n", &n);  s += n;

        /* use the results */
        for (int i = 0;  i < nfields;  ++i) {
            sum += array[i];
        }
    }
    return sum;
}

#undef SKIP_FIELD
#undef DECIMAL_FIELD

int main()
{
    int sum = 0;
    for (int i = 0;  i < 1000000;  ++i) {
        sum += read() * (i&1 ? 1 : - 1); /* alternate add and subtract */
    }
    return sum != 0;
}
Toby Speight
  • 27,591
  • 48
  • 66
  • 103
  • Thanks a lot, eventually I solve this issue with memory map file (see below post) – Chi-fung LAM Apr 26 '18 at 14:54
  • Yes, you could use this approach on a memory-mapped file (but like my example, that means having to use `%n` to advance your position). You should **accept** the answer that solved your problem, and **up-vote** all the ones you found useful. (BTW "above" and "below" are pretty meaningless, as you can't predict the order in which others see the answers). – Toby Speight Apr 26 '18 at 15:12