This is an in-depth account of the guts of gzip. For a simpler description, see:
- wiki:BuiltinTools#tools.gzip for CherryPy 3, or
- wiki:StdFilterGzip for CherryPy 2.
Gzip
The Gzip feature compresses the output of a page "on-the-fly". It follows the guidelines defined by the the RFC2616 (the HTTP/1.1 spec), on the following sections:
- Section 3.5: Content-coding. http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.5
- Section 14.3: Accept-Encoding. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.3
- Section 14.11: Content-Encoding. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11
Compression format
The implementation uses the zlib module -- a low-level compression library that is internally used by the gzip module to compress & decompress gzip-compatible files.
The filter is activated whenever the response header contains the Accept-Encoding option, with the gzip name. The response is modified to include the Content-Encoding: gzip header option. The response body is compressed, and it is sent with the correct header & CRC trailer as per RFC 1952.
Stream format
The gzip format is specified in http://www.faqs.org/rfcs/rfc1952.html. The gzip header is defined as follows (heavily edited from the original source):
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS |
+---+---+---+---+---+---+---+---+---+---+
+=======================+
|...compressed blocks...| (more-->)
+=======================+
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| CRC32 | ISIZE |
+---+---+---+---+---+---+---+---+
As none of the flags are supposed to be set, some optional members that could potentially follow the header are omitted from this presentation. The necessary fields are:
- ID1 (IDentification 1) & ID2 (IDentification 2). These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213), to identify the file as being in gzip format.
- CM (Compression Method). CM = 8 is the standard "deflate" compression method.
- FLG (FLaGs). Zero, as no flags are set for this application.
- MTIME (Modification TIME). The time in Unix format, but can be set to zero safely for stream compressing.
- XFL (eXtra FLags). The "deflate" method (CM = 8) sets these flags as follows:
- XFL = 2 - compressor used maximum compression, slowest algorithm
- XFL = 4 - compressor used fastest algorithm
- OS (Operating System). The actual value should not matter here, because all files are being treated as binary.
- XLEN (eXtra LENgth). Set to zero.
In the trailer part of the stream, the following data should be provided:
- CRC32 (CRC-32). The zlib library provides a suitable method to calculate it.
- ISIZE (Input SIZE). This contains the size of the original (uncompressed) input data modulo 2**32. This requires the size being calculated as the compression goes.
Implementation Notes
What compression should it use?
(or: more than you ever needed to know about gzip vs. compress)
This filter was originally implemented by Remi Delon, using the gzip module. During the modifications to make all filters generator-based, it was rewritten by Carlos Ribeiro using the zlib module, basically because it would allow the compression to be performed chunk by chunk, using less memory, and avoiding the need to compress the whole file at once before sending. The main problem was that the zlib does provide only the underlying functions, and the actual format of the generated stream is up to the programmer. RFC 2616 isn't exactly clear about the exact format of the stream to be used, so some research was needed.
It turns out that there are a few options, the following being of interest to us:
- gzip (in HTTP/1.1) and x-gzip (in HTTP/1.0) are equivalent to a single gzip-compatible file.
- deflate (in HTTP/1.1) are equivalent to a zlib-compressed binary stream, with a slightly simpler and more efficient format than the one used by gzip.
According to the zlib FAQ, the zlib-based method (called compress in the HTTP/1.1 spec) was designed exactly for this application: streaming over a communications channel. However, due to bad implementations, things turned out to be a little different (from http://www.gzip.org/zlib/zlib_faq.html):
36. What's the difference between the "gzip" and "deflate" HTTP 1.1 encodings?
"gzip" is the gzip format, and "deflate" is the zlib format. They should
probably have called the second one "zlib" instead to avoid confusion
with the raw deflate compressed data format. While the HTTP 1.1 RFC 2616
correctly points to the zlib specification in RFC 1950 for the "deflate"
transfer encoding, there have been reports of servers and browsers that
incorrectly produce or expect raw deflate data per the deflate
specficiation in RFC 1951, most notably Microsoft. So even though the
"deflate" transfer encoding using the zlib format would be the more
efficient approach (and in fact exactly what the zlib format was designed
for), using the "gzip" transfer encoding is probably more reliable due to
an unfortunate choice of name on the part of the HTTP 1.1 authors.
Bottom line: use the gzip format for HTTP 1.1 encoding.
Note on the Accept-Encoding header option
RFC 2616 section 14.3 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) defines a very complicated procedure for the Accept-Encoding header option. Multiple values can be sent, and very specific options can be supplied as to what is allowed or not. It's not clear to what extent full support for the spec is needed for a working server.

