

The format on the disk for all levels is the same shielding a decompressor from the level differences. The amount of compute time spent on searching for repeating patterns varies from level to level. Zlib provides 9 levels of compression and a "level 0" which just does a 1:1 uncompressed store of the data. Partly, because they avoided many legal pitfalls which hit other algorithms in the litigious era of the 1990s. The zlib algorithm and file format are standardized in internet standards (RFC 1950, 19) and have been extremely widely used since 1995.


No blog post about data compression implementations can exist without talking about the zlib algorithm, called Deflate, and its implementation. The various implementations I will compare in this blog post differ primarily in how far back they search and how efficient their search algorithms are. The key is the first step as described before: finding the repeating patterns. If multiple algorithms are proven to be theoretically optimal, why does this blog post exist? The catch is: this optimality only happens for very large data sets and, generally, without regard for the computation time and memory consumption this optimal algorithm would take to implement. The LZ77 family of algorithms form the basis of most data compression implementations used in this blog post. I've had the pleasure to be in a classroom watching professor Ziv prove the LZ77 algorithm is likewise theoretically optimal. For example, the Context Tree Weighting was the first algorithm proven to be optimal for very large data streams with a fixed sized startup overhead. Various scientists have proven that an optimal compression algorithm exists. The pattern is turned into an efficient stream of bits and bytes - the so called entropy coding - on the disk. The algorithm uses information from earlier in the file to find repeating patterns. Generally, compression algorithms work in two steps: I am going to spare you the hard sciency bits for the rest of the article but a quick high-level overview will be useful.ĭata compression algorithms take advantage of the redundancy in the files they are compressing to reduce the size of the files and thus the storage/transmission requirements. Data compression falls in the realm of the information theory field of science and has been thoroughly studied since Claude Shannon started research in this area in 1948. About data compressionįirst, let us have a quick word about data compression algorithms. The result are many choices and this blog post tries to show the differences between these choices. Several of these compression algorithms provide a tunable, called "level", a number from 0 to 9 that changes the behavior of the algorithm. The typical list of compression options includes things like zlib, xz, bzip2 as well as lz4 and Snappy. A typical Linux* OS offers many options for reducing the storage space of data.
