Which algorithm is most suitable for large text compression?

Question

Currently, I am looking for an lossless compression algorithm that is suitable for a large amount of text, that will be further encrypt by AES and use as the payload in steganography.

EDIT:

Based on A Comparative Study Of Text Compression Algorithms, it seems that Arithmetic coding is preferable in Statistical compression techniques, while LZB is recommended for Dictionary compression techniques.

So now I am wondering whether Statistical compression or Dictionary compression is more suitable for large English text compression in terms of compression ratio and ease-to-implement.

I have search through but still barely have an idea of the suitable algorithm. Thank you very much for your time in answering. Have a nice day. :)

See [what is the current state of text-only compression algorithms?](https://stackoverflow.com/questions/236456/what-is-the-current-state-of-text-only-compression-algorithms). You should also make clear in your question whether you're looking for a comparison among the algorithms you suggested, or generally the *most suitable* one for the job. In the latter case you have to describe your criteria of "most suitable", e.g., compression ratio, memory, speed, compatibility, easy-to-implement, etc. — Reti43, May 07 '18 at 17:02
@Reti43 Thank you for your reminding. I have added some info to the question. — User233100, May 08 '18 at 05:17

score 5 · Answer 1 · answered May 08 '18 at 00:46

A lot of the algorithms that you are describing in this question are called entropy coders (Shannon-Fano, Huffman, arithmetic, etc.). Entropy coders are used to compress sequences of symbols (often bytes), where some symbols are much more frequent than others. Simple entropy coding of symbols (letters) for compressing natural language will only yield about a 2:1 compression.

Instead, popular modern lossless compression techniques for text include methods like LZ77, LZW, and BWT. Loosely speaking, the LZ family involves building up a dictionary of recurring short symbol sequences (we'll call them "words") and then uses pointers to reference those words. Some of the implementations of LZ like LZ77 and LZW can be fairly simple to code up but probably do not yield the highest compression ratios. See for example this video: https://www.youtube.com/watch?v=j2HSd3HCpDs. On the other end of the spectrum, LZMA2, is a relatively more complicated variant with a higher compression ratio.

The Burrows-Wheeler transform (BWT) provides a clever alternative to the dictionary methods. I'll refer you to the Wikipedia article, https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

In a nutshell, though, it produces a (invertible) permutation of the original sequence of bytes that can often be compressed very effectively by run-length encoding followed by an entropy coder.

If I had to code a compression technique from scratch, for simplicity, I'd probably go with LZW or LZ77.

Brilliant answer, just the right mix of dumbing down and technical details. — Hashim Aziz, Dec 05 '18 at 01:11

score 2 · Answer 2 · answered May 08 '18 at 05:38

Shannon-Fano coding, Huffman coding, Arithmetic coding, Range coding, and Asymmetric Numeral System coding are all zero-order entropy coders applied after you have first modeled your data, taking advantage of the inherent redundancy.

For text, that redundancy is repeated strings and higher-order correlations in the data. There are several ways to model text. The most common are Lempel-Ziv 77, which looks for matching strings, the Burrows-Wheeler Transform (look it up for a description), and prediction by partial matching.

Look to the Large Text Compression Benchmark to see comparisons in compression, compression speed, memory used, and decompression speed.

Which algorithm is most suitable for large text compression?

2 Answers2