Why transposing this std::vector > is so slow?

Question

I have a 1000-lines file of about 400MB representing some numeric data represented as string. I want to transpose the data in order to have only 1000 strings per line, (so that I can open it and plot it fast with pandas).

I imported the whole file in a vector of vector of string that I want to transpose (and eventually I want to write back to file).

I use two nested loops to go through the 2d structure and I write it into some std::ofstream. It is very long. Then I tried to focus on the transposition and I wrote the following code :

//Read 400MB file, 90K strings per line and 1K lines, and store it into
std::vector<std::vector<std::string>> mData;

// ... 
// IO the file and populate mData with raw data 
// ...

//All rows have same number of string
size_t nbRows = mData.size();
size_t nbCols = mData[0].size();

std::vector<std::vector<std::string> > transposedData(nbCols);
for(size_t i = 0 ; i < nbCols  ; ++i)
{
    transposedData[i].resize(nbRows);
    for(size_t j = 0 ; j < nbRows ; ++j)
    {
        transposedData[i][j] = doc.mData[j][i];
    }
}

I thought a few seconds would be sufficient, but it takes several minutes. Also, I am trying with different input dimensions (only 3 lines and much more strings per line, for a same file size of 400MB) and it's much faster.

EDIT 1

On people advice, I performed the profiling using callgrind. I got this message during the process : ... brk segment overflow in thread #1 : can't grow to ...

I analysed the result and summarize it here:
25 % is spent in operator= of basic_string
21 % is spent in construction of basic_string (with only 3% of time in new)
14 % is spent in operator()[] on the outside vector
11 % in spent in operator()[] on the inside vector

Thank you for your suggestions.

Did you profile the code? I'm pretty sure the bottleneck are input and output. — , Oct 03 '19 at 08:50
Are the strings long or short? If they are long and you are not going to reuse mData, you could move the strings instead of copying them. — Marc Glisse, Oct 03 '19 at 08:53
@dyukha Hi, in the code version I wrote here, there is no IO anymore. — Yvon, Oct 03 '19 at 08:55
@MarcGlisse Unfortunately the strings are very short, usually 3 char only. — Yvon, Oct 03 '19 at 08:57
Well, there is IO in your actual code, right? Did you verify already that it's not IO which uses most of the time? — , Oct 03 '19 at 08:58
@dyukha Yes I did, this is actually why I wanted to decouple the transposition task and the IO task. I admit I still need to import the data, (but I have checked this is not the bottleneck), but I am not writting anymore for the moment. — Yvon, Oct 03 '19 at 09:00
Why are you resizing in the loop? You could easily get the vectors of right size, in both dimensions, by constructing `transposedData` sensibly. You could also speed things up by using reference variables, rather than using `operator[]()` repeatedly with the same argument. — Peter, Oct 03 '19 at 09:03
Note: Have a look at this question: [1D or 2D array, what's faster?](https://stackoverflow.com/questions/17259877/1d-or-2d-array-whats-faster) about vector of vector constructs and why you should use a one dimensional approch (with pre-allocation). — Pixelchemist, Oct 03 '19 at 09:17
Because this is too much for the single CPU core. It seems like it is time to add `#pragma omp parallel for` just before `for(size_t i = 0 ; i < nbCols ; ++i)` and `-fopenmp` to compiler and linker, and then check how it is going after this change. — Victor Gubin, Oct 03 '19 at 09:22
If you want to get the job done, you should not have been writing the program at all. Load the file into pandas and transpose the dataframe. If you want to learn C++ stuff that's another story. — n. m. could be an AI, Oct 03 '19 at 12:18
Do you really need to copy all the strings? If you don't need the original `mData` afterward, you could move-construct the `transposedData` values. Alternatively, if you do need them, but they outlive `transposedData`, you could change the result to contain `std::string_view` objects instead. — Toby Speight, Oct 03 '19 at 12:25
You ought to create a [mcve]. That should be easy enough: add a `main()` and the required `#include` directives, and replace the "IO" comment with something to populate `mData` with test values. — Toby Speight, Oct 03 '19 at 12:28
@n.m. I tried with pandas read_csv but it did not work because there are too many columns. — Yvon, Oct 03 '19 at 13:54
@Toby Speight Yes, then we would have realized immediately that there is no problem with this code and that the problem comes from my environment (here, memory). — Yvon, Oct 03 '19 at 13:56

Davide Spataro · Answer 1 · 2019-10-03T09:21:20.637

First of all before making any claim on the reason why a piece of code is slow you should really measure its performance on your machine and then with data at hand deduce why.

That said I am quite confident in this case in saying that the problem might reside in the fact that you are allocating 90k vectors of string each of them of size 1k. As you know memory allocation is costly, and it might explain your performance penalty.

The following is how you could implement your code using only a 1D array allocated upfront.

size_t nbRows = mData.size();
size_t nbCols = mData[0].size();

auto get_idx = [](const int i, const int nr, const int j)
{
    return i*nr+j;
};

std::vector<std::string> transposedData(nbCols*nbRows);  
for(size_t i = 0 ; i < nbCols  ; ++i)
{
    for(size_t j = 0 ; j < nbRows ; ++j)
    {
        const int idx = get_idx(j, nbCols,i);
        transposedData[idx] = std::move(mData[j][i]);
    }
}

for(size_t i = 0 ; i < nbCols  ; ++i)
{
    for(size_t j = 0 ; j < nbRows ; ++j)
    {
        const int idx = get_idx(j, nbCols,i);
        cout<<transposedData[idx]<<" ";
    }
    cout<<endl;
}

I would like to stress it again: profile your code. Try out software like valgrind --tool= callgrind or gprof allowing you to profile and visualize performance data about your app.

Thank you, actually it appears to be too much memory... "terminate called after throwing an instance of std::bad_alloc". — Yvon, Oct 03 '19 at 09:50

score 1 · Answer 2 · answered Oct 03 '19 at 12:35

The program has redundancies on several levels.

The obvious one is that you don't need to transpose the vector in order to transpose the file.

vector<vector<string> originalData;
// read the file to originalData

for(size_t i = 0 ; i < nbCols  ; ++i)
{
    for(size_t j = 0 ; j < nbRows ; ++j)
    {
        cout << originalData[j][i] << " ";
    }
    cout<<endl;
}

Assuming you do need to produce a transposed vector for some reason, one way to write the transpose loop would be

vector<vector<string>> transposedData (nbCols);
for (size_t j = 0; j < nbCols; ++j)
{
    transposedData[j].reserve(nrows);
    for (size_t i = 0; i < nbRows; ++i) 
    {
        transposedData[j].emplace_back(originalData[i][j]);
        // if keeping original veector is not needed ...
        // transposedData[j].emplace_back(std::move(originalData[i][j]));
    }
}

On my (fairly beefy) machine it takes about 6-7 seconds to transpose a 1000x90000 matrix of 3-character strings. This is not particularly impressive, if you don't need to transpose multi-million-element matrices 24 hours a day, it does what you need without too much overhead.

score 0 · Answer 3 · answered Oct 03 '19 at 09:21

The penalty may come from the fact you're overusing resize on your for loop.

As per the reference :

Complexity

Linear in the difference between the current size and count. Additional complexity possible due to reallocation if capacity is less than count

Memory allocation costs a lot, so you may want to avoid overdoing it.

As pointed by others, upfront allocation would be an interesting approach to avoid recreating (resizing) your vector every time.

score 0 · Accepted Answer · edited Oct 03 '19 at 20:41

0

There is not enough available memory on my machine to perform this task (see below). Splitting my data in three parts, I solved the task in a few seconds. Here is the output of a code scrutinizing memory :

free ram 2.5GB  
IO populating mData with raw data  
free ram 0.2GB  
Empty string capacity : 15 bytes  
Intending to allocate 1.4 GB  
terminate called after throwing an instance of 'std::bad_alloc'  
  what() : std::bad_alloc  
Aborted

edited Oct 03 '19 at 20:41

Nikola Lukic

4,001
6
44
75

answered Oct 03 '19 at 10:31

Yvon

15
5

Please use the edit link on your question to add additional information. The Post Answer button should be used only for complete answers to the question. - [From Review](/review/low-quality-posts/24213996) – Valentino Oct 03 '19 at 12:38
I am satisfied with this solution so I edited my answer so that it looks like a complete answer. – Yvon Oct 03 '19 at 13:46

Why transposing this std::vector > is so slow?

4 Answers4