I have a 1000-lines file of about 400MB representing some numeric data represented as string. I want to transpose the data in order to have only 1000 strings per line, (so that I can open it and plot it fast with pandas).
I imported the whole file in a vector of vector of string that I want to transpose (and eventually I want to write back to file).
I use two nested loops to go through the 2d structure and I write it into some std::ofstream. It is very long. Then I tried to focus on the transposition and I wrote the following code :
//Read 400MB file, 90K strings per line and 1K lines, and store it into
std::vector<std::vector<std::string>> mData;
// ...
// IO the file and populate mData with raw data
// ...
//All rows have same number of string
size_t nbRows = mData.size();
size_t nbCols = mData[0].size();
std::vector<std::vector<std::string> > transposedData(nbCols);
for(size_t i = 0 ; i < nbCols ; ++i)
{
transposedData[i].resize(nbRows);
for(size_t j = 0 ; j < nbRows ; ++j)
{
transposedData[i][j] = doc.mData[j][i];
}
}
I thought a few seconds would be sufficient, but it takes several minutes. Also, I am trying with different input dimensions (only 3 lines and much more strings per line, for a same file size of 400MB) and it's much faster.
EDIT 1
On people advice, I performed the profiling using callgrind. I got this message during the process : ... brk segment overflow in thread #1 : can't grow to ...
I analysed the result and summarize it here:
25 % is spent in operator= of basic_string
21 % is spent in construction of basic_string (with only 3% of time in new)
14 % is spent in operator()[] on the outside vector
11 % in spent in operator()[] on the inside vector
Thank you for your suggestions.