What is the right LZ77 compression input and output? (binary)

Question

So, I'm coding LZ77 compression algorithm. Here are the program requirements:

Program should compress any uncompressed file (.txt, .bmp and so on)
Based on aforesaid, program should work with binary

And now the things start to get a little bit unclear for me, so I need some advices. Here are my thoughts on the problem:

The input file should be opened as std::ios::binary. So, what's next? For convenience, I think it's optimal to work with binary as bytes, and for byte interpretation I'm using chars (because 1 char is equal to 1 byte, if I'm not mistaken).Thus I read file data as bytes (chars) to string. Now I can encode this byte string using the algorithm. The output of encoding is vector of triplets . Now I need to interpret this output data as bytes to save into compressed file, right? Because offset and length are ints, I need to divide them into bytes (chars) to write them to output binary stream (because ofstream::write arguements are char* and number of characters (bytes) to insert). nextchar can be written as is. For decoding I reverse the steps: read triplet bytes from file and then decode using algorithm.

Am I missing something crucial? Is it right to work with bytes at all in this case? Any faulty logic? Below is some code:

// Construct int from bytes
int intFromBytes(std::istream& is)
{
    char bytes[4];
    for (int i = 0; i < 4; ++i)
        is.get(bytes[i]);

    int integer;
    std::memcpy(&integer, &bytes, 4);
    return integer;
}

// Get bytes from int
void intToBytes(std::ostream& os, int value)
{
    char bytes[4];
    std::memcpy(&bytes, &value, 4);
    os.write(bytes, 4);
}

struct Node
{
    int offset, length;
    char next;
}

///// Packing /////

// Open and read binary data as a byte string
void readFileUnpacked(std::string& path) 
{
    std::ifstream file(path, std::ios::in | std::ios::binary);

    if (file.is_open())
    {

        Buffer = std::string(std::istreambuf_iterator<char>(file), {});
        file.close();
    }
}

///// Here should be the encoding logic /////

// Writing encoded triplets as bytes
void createFilePacked(std::string& path) 
{
    std::ofstream out(path, std::ios::out | std::ios::binary);

    if (out.is_open())
    {
        for (auto node : encoded)
        {
            intToBytes(out, node.offset);
            out << node.next;
            intToBytes(out, node.length);
        }
        out.close();
    }
}

///// Unacking /////

// Reading packed file as binary
readFilePacked(std::string& path)
{
    std::ifstream file(path, std::ios::in | std::ios::binary);

    if (file.is_open())
    {
        Node element;

        while (file.peek() != std::ifstream::traits_type::eof())
        {
            element.offset = intFromBytes(file);
            file.get(element.next);
            element.length = intFromBytes(file);
            encoded.push_back(element);
        }
        file.close();
    }
}

///// Here should be the decoding logic /////

createFileUnpacked(std::string& newpath) 
{
    std::ofstream out(newpath, std::ios::out | std::ios::binary);
    out << Buffer;
    out.close();
}

Don’t forget to define the endian-ness of the int-values in your file format (as big-endian or little-endian) and include code to byte-swap to/from that format when saving or loading the values; otherwise your code will break when a file saved on a big-endian computer is loaded on a little-endian computer or vice-versa. — Jeremy Friesner, Nov 29 '19 at 15:53
I am not an expert in LZ7, but from my experience with other compression algorithms, the encoded elements are not bytes. They are probably encoded with some odd number of bits, like 9 or 12. Maybe you could post a section from the algorithm's specification, where it says what the type of the encoded elements is. See also here: https://stackoverflow.com/q/27589460/509868 — anatolyg, Nov 29 '19 at 16:01
@anatolyg LZ77 algorithm achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a triplet . I store these triplets (nodes) as a vector. So, I need to save the triplet code into an output file. Afaik, to do this I need to write every member of triplet (2 ints and 1 char) as bytes. So I divide every int into an array of chars and then write it to an output stream. — asymmetriq, Nov 29 '19 at 16:08
As far as I understand your question, you are asking whether to write as bytes. The answer is no, but I don't have more details, that's why I don't write an answer. — anatolyg, Nov 29 '19 at 16:12

What is the right LZ77 compression input and output? (binary)

0 Answers0