C++ decoding LZ77-compressed data using std::fstream too slow

Question

I have a function in my code which decodes a file compressed using the LZ77 algorithm. But on 15 MB input file decompression takes about 3 minutes (too slow). What's the reason of poor performance? On every step of the loop I read two or three bytes and get length, offset and next character. If offset is not zero I also have to move "offset" bytes back in output stream and read "length" bytes. Then I insert them to the end of the same stream before writing next character there.

void uncompressData(long block_size, unsigned char* data, fstream &file_out)
{
    unsigned char* append;
    append = new unsigned char[buf_length];
    link myLink;
    long cur_position = 0;
    file_out.seekg(0, ios::beg);
    cout << file_out.tellg() << endl;
    int i=0;
    myLink.length=-1;
    while(i<(block_size-1))
    {
        if(myLink.length!=-1) file_out << myLink.next;
        myLink.length = (short)(data[i] >> 4);
        //cout << myLink.length << endl;
        if(myLink.length!=0)
        {
            myLink.offset = (short)(data[i] & 0xF);
            myLink.offset = myLink.offset << 8;
            myLink.offset = myLink.offset | (short)data[i+1];
            myLink.next = (unsigned char)data[i+2];
            cur_position=file_out.tellg();
            file_out.seekg(-myLink.offset,ios_base::cur);
            if(myLink.length<=myLink.offset)
            {
            file_out.read((char*)append, myLink.length);
            }
            else
            {
                file_out.read((char*)append, myLink.offset);
                int k=myLink.offset,j=0;
                while(k<myLink.length)
                {
                    append[k]=append[j];
                    j++;
                    if(j==myLink.offset) j=0;
                    k++;
                }
            }
            file_out.seekg(cur_position);
            file_out.write((char*)append, myLink.length);
            i++;
        }
        else {
            myLink.offset = 0;
            myLink.next = (unsigned char)data[i+1];
        }
        i=i+2;
    }
    unsigned char hasOddSymbol = data[block_size-1];
    if(hasOddSymbol==0x0) { file_out << myLink.next; }
    delete[] append;
}

Each file operation, especially on MS-Windows, is slow. The whole reason iostream uses a buffer to read large chunks of the file, before giving it to you, one character at a time, is the overhead. Furthermore, seeking, writing, and seeking more, defeats this buffering entirely. The shown code is not salvageable and should be rewritten from scratch, by reading the entire file in memory, and then decompressing it entirely in memory. — Sam Varshavchik, Aug 17 '20 at 11:03
But what should I do if I don't know exact size of output file in advance? — Maxim Voloshin, Aug 17 '20 at 11:08
@MaximVoloshin just read into a `std::vector` as long as you don't reach the `eof()`. Then you can randomly any byte you like and you can get the size from the vector. See [Loading a file into a vector](https://stackoverflow.com/questions/7241871/loading-a-file-into-a-vectorchar) — Jan Schultke, Aug 17 '20 at 11:16
Then you devise an algorithm that does not assume any specific output file size. — Sam Varshavchik, Aug 17 '20 at 11:35
@Sam Varshavchik P.S. Compressed input file is already loaded into memory (it's in "data"). Most of operations are performed on the output file. — Maxim Voloshin, Aug 17 '20 at 11:41
Great. Now, all that needs to be done is to decompress it in memory. — Sam Varshavchik, Aug 17 '20 at 11:45

score 2 · Accepted Answer · edited Aug 17 '20 at 12:13

2

You could try doing it on a std::stringstream in memory instead:

#include <sstream>

void uncompressData(long block_size, unsigned char* data, fstream& out)
{
    std::stringstream file_out;                 // first line in the function

    // the rest of your function goes here

    out << file_out.rdbuf();                   // last line in the function
}

edited Aug 17 '20 at 12:13

Jan Schultke

17,446
6
47
96

answered Aug 17 '20 at 11:16

Ted Lyngmo

93,841
5
60
108

Are you sure that works? Stream buffers are limited in size so `.str()` only gives you a portion of the decoded file. Not sure if they can grow infinitely for stringsteams or if this is specified by the standard at all. Anyhow, why don't you just `out << file_out.str()` ? – Jan Schultke Aug 17 '20 at 11:19
@J.Schultke If the file is too big, it'll certainly not work and doing the work on disk will probably be needed, like OP already does - or using some kind of memory mapped file. "_Anyhow, why don't you just `out << file_out.str()`_" - Because that makes a copy of the underlying string and this move-constructs it. – Ted Lyngmo Aug 17 '20 at 11:21
ok, then why don't you just `out << file_out.rdbuf()`? See [How to copy from one stream to another?](https://stackoverflow.com/questions/3738233/how-to-copy-binary-data-from-one-stream-to-another). – Jan Schultke Aug 17 '20 at 11:57
@J.Schultke Guess I'm too tired right now :-) I updated it. Thanks. – Ted Lyngmo Aug 17 '20 at 12:07
Why this works faster? Because is in memory and not buffering? – Nick Aug 20 '20 at 11:57
@Nick Yes, doing it all in memory is usually a lot faster. Even if all disk operations are done in a disk buffer in memory, your program will not know that so it can't be optimized to take advantage of that. With everything in an `stringstream` the compiler can do a lot more optimizations. – Ted Lyngmo Aug 20 '20 at 12:03

C++ decoding LZ77-compressed data using std::fstream too slow

1 Answers1