I need to read in many big CSV file to process in C++ (range from few MB to hundreds MB) At first, I open with fstream, use getline to read each line and use the following function to split each row"
template < class ContainerT >
void split(ContainerT& tokens, const std::string& str, const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();
using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;
while (lastPos < length + 1)
{
pos = str.find_first_of(delimiters, lastPos);
if (pos == std::string::npos)
{
pos = length;
}
if (pos != lastPos || !trimEmpty)
tokens.push_back(value_type(str.data() + lastPos,
(size_type)pos - lastPos));
lastPos = pos + 1;
}
}
I tried boost::split,boost::tokenizer and boost::sprint and find the above give the best performance so far. After that, I consider read in the whole file into memory to process rather than keep the file opened, I use the following function to read in the whole file with the following function:
void ReadinFile(string const& filename, stringstream& result)
{
ifstream ifs(filename, ios::binary | ios::ate);
ifstream::pos_type pos = ifs.tellg();
//result.resize(pos);
char * buf = new char[pos];
ifs.seekg(0, ios::beg);
ifs.read(buf, pos);
result.write(buf,pos);
delete[]buf;
}
Both functions are copied somewhere from the net. However, I find that there is not much difference in performance between keep file opened or read in the whole file. The performance capture as follow:
Process 2100 files with boost::split (without read in whole file) 832 sec
Process 2100 files with custom split (without read in whole file) 311 sec
Process 2100 files with custom split (read in whole file) 342 sec
Below please find the sample content of one type of file(s), I have 6 types need to handle. But all are similar.
a1,1,1,3.5,5,1,1,1,0,0,6,0,155,21,142,22,49,1,9,1,0,0,0,0,0,0,0
a1,10,2,5,5,1,1,2,0,0,12,0,50,18,106,33,100,29,45,9,8,0,1,1,0,0,0
a1,19,3,5,5,1,1,3,0,0,18,0,12,12,52,40,82,49,63,41,23,16,8,2,0,0,0
a1,28,4,5.5,5,1,1,4,0,0,24,0,2,3,17,16,53,53,63,62,43,44,18,22,4,0,4
a1,37,5,3,5,1,1,5,0,0,6,0,157,22,129,18,57,11,6,0,0,0,0,0,0,0,0
a1,46,6,4.5,5,1,1,6,0,0,12,0,41,19,121,31,90,34,37,15,6,4,0,2,0,0,0
a1,55,7,5.5,5,1,1,7,0,0,18,0,10,9,52,36,86,43,67,38,31,15,5,7,1,0,1
a1,64,8,5.5,5,1,1,8,0,0,24,0,0,3,18,23,44,55,72,57,55,43,8,19,1,2,3
a1,73,9,3.5,5,1,1,9,1,0,6,0,149,17,145,21,51,8,8,1,0,0,0,0,0,0,0
a1,82,10,4.5,5,1,1,10,1,0,12,0,47,17,115,35,96,36,32,10,8,3,1,0,0,0,0
My questions are:
1 Why read in whole file will perform worse than not read in whole file ?
2 Any other better string split function?
3 The ReadinFile function need to read to a buffer and then write to a stringstream to process, any method to avoid this ? i.e. directly into stringstream
4 I need to use getline to parse each line (with \n) and use split to tokenize each row, any function similar for getline for string ? e.g. getline_str ? so that I can read into string directly
5 How about read the whole file into a string and then split the whole string into vector with '\n' and then split each string in vector with ',' to process ? Will this perform better ? And what is the limit (max size) of string ?
6 Or I should define a struct like this (based on the format)
struct MyStruct {
string Item1;
int It2_3[2];
float It4;
int ItRemain[23];
};
and read directly into a vector ? How to do this ?
Thanks a lot.
Regds
LAM Chi-fung