ZIP2 file format specification
subtitle
by John M. Dlugosz
john@dlugosz.com
Last edited 15 Aug 2000 11:43 PM CDT
ZIP2 file format specification
*1 Why
*2 Design Goals
*3 Overview
*3.1 Why ZIP2 files are smaller
*3.2 Why ZIP2 files are easier to manipulate
*3.3 Multi-File Spanning Archives
*3.4 Extensibility
*3.5 Scenarios
*3.5.1 Extracting a File
*3.5.2 Adding a File
*3.5.3 Replacing a File
*3.5.4 Splitting an Archive
*3.5.5 Acting as a Filter
*4 Structure
*4.1 Non-Chunk Header
*4.2 Fill Bytes
*4.3 Data Type Definitions
*4.4 Chunks
*4.4.1 Structure Definition
*4.4.2 Chunk Type (name)
*4.4.3 Instances
*5 Chunk Types
*5.1 [ME]
*5.2 CMNT, cMNT
*5.3 COMP, cOMP
*5.4 Comp
*5.5 CRYP, cRYP
*5.6 Cryp
*5.7 DATA, dATA
*5.8 Data
*5.9 DICT, dICT
*5.10 Dict
*5.11 DICt
*5.12 ENdX
*5.13 FaTR, faTR
*5.14 ICON
*5.15 IDEN
*5.16 INDX, iNDX
*5.17 Indx
*5.18 MaCF, maCF
*5.19 Macf
*5.20 ROOT
*5.21 RTFM
*5.22 SECU, sECU
*5.23 SHAC,sHAC
*5.24 ShAI, shAI
*5.25 Shai
*5.26 SIGN, sIGN
*5.26.1 Security Level Indicator
*5.27 SIgn
*5.28 TiME, tiME
*5.29 TOcN, TOcn
*5.30 TREE, tREE
*5.31 Tree
*5.32 TREe
*5.33 UnIX, unIX
*5.34 XXXX
*5.35 User Defined
*6 Built-In Chunk Definitions
*6.1 COMP
*6.2 CRYP
*6.3 DICT
*6.4 TREE
*7 Notes on Archive Entries
*7.1 The Index Structure
*7.2 Entry Names
*7.3 Subdirectories
*8 Compression Methods
*8.1 Deflation (method 1)
*8.2 UTR-6 (method 2)
*8.3 Run-Length, Repeat, Delta (method 3)
*8.4 Huffman Encoding (method 4)
*8.5 Composition (method 5)
*8.6 Burrows-Wheeler Transform (method 6)
*9 Encryption
*10 Details of Algorithms
*10.1 CRC-32
*10.2 UTF-8
*10.3 D
ługosz’ Variable-Length Integer Encoding *10.3.1 Introduction
*10.3.2 Overview
*10.3.3 Benefits and Features
*10.3.4 Criticisms and Possible Improvements
*10.3.5 Details
*11 Conformance
*11.1 Baseline 2000
*11.2 Robust 2000
*
Implementations of ZIP managers often leave out all support for multi-file spanning archives. Those that do have it provide very little support. Often all you can do with a split archive is extract, not add or otherwise manipulate. Also, most tools assume that spanning archives will be on removable media.
The lack of support stems from the poor design of spanning in the old ZIP specification.
ZIP2 is designed to make it easy to handle multi-file spanning archives.
ZIP files have a bit for "encrypted", but the standard ZIP encryption method is easily broken by cryptoanalysis. In fact, many programs exist to break password-protected ZIP files, and it takes a fraction of a second to do so.
Good algorithms are available. It is a given that these algorithms must continue to be revised to keep up with advances in cryptoanalysis and increases in CPU speed and storage.
Yet it is problematic to use a different algorithm in a classic ZIP file. If you flag it as encrypted, it won’t decode correctly on a standard decompressor. You just can’t do it without violating the standard format.
In ZIP2, all data can be encrypted. This applies to the file data (DATA chunks) as well as other information such as the list of file names and file comments. The encryption method is specified, so new methods can be added without breaking the file format.
Programs can adopt the ZIP2 file format and customize it to hold additional information, while still maintaining compatibility with standard tools.
The standard tools will be backward compatible with older versions, as the specification is updated.
The standard tools can grow to accommodate new things without the need for new code—various data is supplied as otherwise ordinary ZIP2 files, rather than being built-into the program. These files can be updated, and used to upgrade any decoder regardless of platform or machine instruction set.
See §3.4, Extensibility.
See §3.2, Why ZIP2 files are easier to manipulate.
See §3.1, Why ZIP2 files are smaller.
Classic ZIP files use 8-bit characters for file names, with no specification as to what character set is used. Presumably, it uses the native character set of whoever created the archive, which might not match the native character set of who tries to extract files from it.
Furthermore, file systems (including NTFS and UDF used for DVD’s) can use more than 256 characters at a time.
All text in ZIP2 uses Unicode.
A 32-bit length value is quickly becoming a problem. As I write this, inexpensive hard disks are in the 20–30 GB range, and removable media (DVD-RAM) is 2.2 GB. So it’s not out of the question that a user could have a file larger than 4GB in size.
Why ZIP2 files are smaller
ZIP2 files are smaller than the equivalent ZIP files even though both use the same compression algorithm (deflation). Here’s why:
The chunk system makes it easy to add previously unknown information to newer versions of the ZIP2 specification. Well-behaved user-defined chunks allow application-specific information to be associated with stored files, using the identical system as the standard items.
Scenarios
The user asks the program to extract
foo.txt from testfile.zip2.The program quickly skims the ZIP2 file to see which chunks are present. One of the chunks is the list of files contained in the archive. Reading that chunk (
INDX #0) and consulting the information in it, it determines that foo.txt is number 42. There are no file positions or offsets stored. Rather, everything refers to chunk instances, which can be anywhere in the file.The chunk
DATA #42 contains the compressed and encrypted data for the file. This chunk is read and decoded, and the file foo.txt written.
The user asks the program to add
bar.txt to the existing archive testfile.zip2.The program sees that the directory (
INDX #0) happens to be the last in the file. This is especially convenient, and the program proceeds by reading all the information from this chunk, and deleting it, thus shortening the file.A new
DATA chunk is appended to the file containing the content from bar.txt. It is assigned a number previously unused, DATA #99.A record for bar.txt is added to the directory information, and an updated INDX chunk written, once again at the end of the file.
The user asks the program to add
bar.txt to the existing archive testfile.zip2.The directory is not at the end of the file, so deleting it leaves a hole in the file. Rather than having to rewrite the entire archive, this deleted area can be re-used.
The new DATA #99 chunk is split into two parts. Part 1 is sized to exactly fit into the hole left by deleting the
INDX #0 chunk. Part 2, the remainder, is appended to the file.Finally, the updated INDX #0 chunk is written.
The user asks the program to update
bar.txt in the archive testfile.zip2.The
INDX chunk indicates that bar.txt is file number 99. DATA chunk #99 is overwritten in-place, but the new one is shorter than the old one. The unused space is claimed with a XXXX chunk or fill bytes.The INDX chunk is updated with the new file length and date. The result may be a different size than the original, due to the use of variable-length number fields. Also, if the INDX chunk is compressed, changing a couple bytes could very well change the compressed size.
Deleting the INDX chunk leaves a hole, or may be at the end of the file, as explained in other cases. Writing the new INDX chunk is handled in exactly the same way as the DATA chunk in that it can use an XXXX chunk if shorter, and split it if longer.
The user asks the program to update
bar.txt in the archive testfile.zip2.The
INDX chunk indicates that bar.txt is file number 99. DATA chunk #99 is overwritten in-place, but the new one is longer than the old one.The
DATA #99 chunk is split into two parts. Part 1 is sized to fit in the old location, and Part 2, the remainder, is appended to the file.The INDX chunk is updated with the new file size and timestamp, as explained elsewhere.
Suppose a server had a monster-sized archive for people to download. Rather than downloading the whole thing at once, a CGI program was present to allow the user to pick any size desired, and the archive would be divided into multiple files of at most that size.
The program needs to split the archive on the fly, with an arbitrary file size.
Start by positioning the input at the first chunk, just after the non-chunk header.
Open a new output file, and start by writing a non-chunk header. Subtract the length of the non-chunk header from the threshold length.
Copy one chunk at a time from input to output. When you reach a chunk that won’t fit in the output without going over the threshold size, either leave it out (to be the first thing in the next file) or split the chunk into two parts. In the former case, the output file will be smaller than the threshold size. In the latter case, write the first part to the output, then start a new output file. The second part of the split chunk is the first chunk written to the new file.
Repeat until all chunks are copied. The program could also generate chunks that specify which part contains which (regular) chunk, and add them to the last output file. Even if it doesn’t do that for every interesting chunk, it ought to do so for the
INDX #0 at least.The program can also add
IDEN chunks. This is highly recommended, as it will identify the files even if the browser munged the file names when downloading.The program will choose not to split a chunk if it’s one that can’t be split, or if the first part of the split chunk would be very small.
There are two issues of interest to some that ZIP2 does handle. One is when the input is of unknown length, the other is when the output is not a file.
An input of unknown length is handled easily because the length of the file is only present in the
INDX entry, which may be emitted last.Of more concern is writing the ZIP2 archive to a transmission channel as it works, rather than buffering the entire job before it can write the DATA chunk. After all, the
DATA chunk starts with a length.The solution is to write multiple DATA chunks. The program can be compressing incoming data and buffering the compressed data for a short time. When the consumer is ready for more output, the compressed data at hand can be wrapped in a dATA chunk, specifying the length thus far. Meanwhile, compression of input continues. When the input is finally ended, and compression completes, a final DATA chunk is emitted. If necessary (that is, you didn’t see the end coming) the final DATA chunk can be zero-length.
StructureA ZIP2 file has the following structure: The bulk of the file is a collection of chunks concatenated together, possibly with fill bytes. Before this is a short non-chunk header, so called because it’s the only stuff that’s not formatted as chunks.
Non-Chunk HeaderThis short segment allows a file to be easily identified as a ZIP2 file, and provides for embedding a ZIP2 data stream in a larger file.
Signature:
4 bytes containing "ZIP2"Usage text:
at most 128 bytes (and as little as 0 bytes) of text, intended to explain what to do with this file type, not be a comment about the contents of this specific archive. For example, "See http://www.dlugosz.com/ZIP2/ for information on the ZIP2 file format.".Post-signature:
11 or 15 bytes containing "\x1a\x04\xfe" followed a uint32 or uint64 length followed by "ZIP2" again. The first part terminates the text on most systems (^Z or ^D), so the first signature and usage text can be typed (or cat’ed) as a text file. The hex FE makes sure the high bits have not been lost, and is also a byte that is never present in a UTF-8 string. This means that the end of the usage text can be located unambiguously by scanning for a hex FE byte. Then the signature is repeated just to make it less likely a random text file will accidentally look like a ZIP2 file.The 4-byte length may be hex FFFFFFFF which means "unspecified". Otherwise, it is taken to be the length of the body (chunked part) of the data, and anything after that is ignored. This provides for embedding a ZIP2 data stream inside a larger file.
The length is not a variable-length field to facilitate in-place updates as the file changes.
To facilitate deleting or resizing chunks in-place, fill bytes may be used between chunks. A fill byte has the value of 0.
For regions larger than 12 bytes, use an XXXX chunk to mark the unused area in the file. The fill bytes are designed for gaps smaller than the minimum size of a regular chunk (which therefore can’t hold a XXXX chunk).
Since the chunk begins with a length, and a single zero byte is read as a length of zero, this elegantly fits the definition of a degenerate chunk. Code that walks the chunks in a file will easily handle the presence of fill bytes (see Listing 1 on page
*).lpstring:
All text fields are encoded as Unicode in UTF-8 format, length-prefixed with a uintV specifying the number of bytes (not characters) that follow.For example, "Hello" would be encoded as hex 05
48 6C 6C 6F, and “Długosz” as hex 08 44 C9 82 75 67 6F 73 7A, and the complete text of Moby Dick as hex D2 54 F1 43 41 6C 6C 20 6D 65 20 49 73 68 6D 61 65 6C … 1,201,374 bytes omitted for brevity … 61 67 6F 2E.sintV:
A signed variation of uintV.uint16:
A 16-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.uint32:
A 32-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.uint64:
A 64-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.uintV:
A variable-length unsigned value. It may represent an integer of any length, and require from 1 to an arbitrary number of bytes. The encoding is detailed in section 10.3 on page *.text:
When text is found as the body of a block whose length is already known, it is not necessary to use a length prefix. This is specifically true when the content of a chunk’s Payload is text, as with file comments.The text is in UTF-8 format. Line-break characters are code hex 0A, which is the "\n" in C and Perl. This encodes as one byte of UTF-8, while U+2028 or U+2029 would require three.
time64:
8 byte, big-endian, time as used for FILETIME in Windows. It has a 100 nanosecond granularity, the finest of any common system. Times are always stored in Universal Coordinated Time (UTC).The bulk of the file is made up of chunks. The chunks are simply concatenated together, and each chunk has the same layout so they may be traversed without actually decoding them all.
Structure DefinitionLength:
uintV. This is the total length, in bytes, of the chunk not counting the length field itself.A simple program can traverse the chunks by looking at the length field.
sub skim_file ($)
{
my $file= shift; #the input file, queued past the non-chunk header.
while (!eof($file)) {
my $length= read_uintV ($file);
if ($length >= 4) { #otherwise can’t have 4 byte type tag
my $type;
read ($file, $type, 4);
print "chunk type: $type, length: $length\n";
}
seek ($file, $length-4, SEEK_CUR); #skip to next chunk
}
}
Note that the logic of reading the Length, then seeking that many bytes to get to the next chunk will naturally handle the fill bytes between chunks. In general, such code that skims a ZIP2 file looking for something in particular should check the length before assuming the chunk contains subsequent fields. If the length is less than 4 bytes, it’s improper to read a Type field within that chunk’s extent! With this logic, fill bytes are simply degenerate chunks.
Type:
4 byte name field. This indicates the content of the chunk, and is described in detail in §4.4.2 below.Instance_number_range:
2 uintV. Starting number and count of how many instances are in this chunk. The chunk instance number is unique to the type, but different types may have the same number without conflict. Chunks are referenced by number, not by position in the file (or even which file they are in, in spanning archives), so this number is very important. The number 0 may be repeated, as it indicates "not applicable" or "don’t care".For example, hex 07 01 is simply instance seven. Hex 07 10 means that this chunk holds instances seven through twenty-two, inclusive.
Part:
unitV. A chunk’s content may be arbitrarily chopped up into smaller chunks. When this happens, each is given the same number and consecutive parts, starting with 0. For example, DATA chunk #3 may contain compressed content of a file, and it is chopped up into dATA #3 pt. 0 and DATA #3 pt. 1 (and each may be in a different file, for spanning archives).Payload_specification:
This is data that is not present in every chunk type, so is considered part of the chunk-specific data (content). However, it applies to all instances, so it appears before the Payload and is not repeated in each data packet. It consists of 2 uintV’s, specifying the Compression and the Encryption, respectively.Payload:
The actual content. Each packet of data (each instance number is associated with one packet), except the last, is prefixed with its length (as a uintV) and the whole thing concatenated together. Fully-structured chunk types omit the length, indicated by the second name flag.Checksum:
uint32. A CRC-32 (see §10.1 on page *) check of the entire chunk, including the Length, through the end of the Data. The algorithm is exactly as with classic ZIP files, but the 1’s complement is stored.The chunk is named using a system similar to PNG/MNG, in that 4 bytes are chosen so they happen to be mnemonic as ASCII characters, and the 0x20 bit positions are flags, so that flags are visible as lower-case letters.
The first flag indicates that the chunk is "continued". That is, it is not the last or only part. The last part (which may be the only part) has this flag cleared.
The second flag indicates that the chunk data is fully structured, and that multiple instances can be concatenated together without including the length of each one. You can always tell where one ends and the next begins.
The third flag indicates that the chunk is redundant. That is, the information can be calculated or found, but is included for efficiency. For example, a directory of what can be found in one part of a multi-part archive allows the part to stand alone too, but the information is redundant with what’s in the master directory on the last piece of media.
The fourth flag indicates that this chunk is a pointer to the chunk that might have been present here. It can specify which file (in a spanning set) the chunk can be expected to reside in, point to another file, or indicate a standard definition.
Chunk instances is a central concept of this file format. Everything refers to some chunk instance, rather than an actual file position.
Each chunk header indicates not a single instance but a range of instances. Having multiple instances under one header saves overhead for very small chunks, but more importantly allows the entire group to be compressed as one data stream.
In a range, all the instances’ data are concatenated together. They may be put together along with lengths, or the lengths may not be needed, as the case may be. The resulting combined data is the Payload of this chunk.
If there is a Payload Specification, this data is further processed before being placed in the chunk’s payload area. First it is compressed, then it is encrypted.
For small items like file comments, timestamps, and other attributes, being able to compress the whole set as one data stream is very important. Yet, the concept of associating an instance of (whatever) with each DATA instance still holds.
For
DATA chunks, being able to combine and compress several small files as a group can improve compression. It’s the exact same mechanism—an implementation can treat chunks uniformly.Identifies the creator of the archive, especially for purposes of recognizing the meanings of user-defined chunks. No Payload Specification. Details TBA.
Comments for the corresponding files. Each instance is UTF-8 text. Note that the text (if it is plain text—it may be compressed and/or encrypted) isn’t preceded by a length as with text fields in structured data. However, when multiple instances are in the same chunk, the length prefix exists for all strings except the last, because it’s part of the chunk format.
Instances from hex 40 to hex FFF inclusive shall not be used, as these are reserved for built-in standard definitions.
The Payload data starts with a uintV that indicates which compression algorithm is to be used. It is then followed by parameter and configuration information specific to that algorithm. See §7 starting on page
*.Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification.
The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding COMP chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.
Information about encryption, TBA.
The Encryption value in a Payload Specification refers to an instance of CRYP.
See §8.5 starting on page
*.Every chunk pointing to the same instance of CRYP is encrypted using the same key. If the CRYP instance requires prompting for a pass-phrase, then all uses of this instance refer to the same pass-phrase. Another instance of CRYP that points to the same crypto algorithm and parameters would imply that although the same algorithm is used, the user should be prompted again for a (presumably different) pass-phrase.
If the crypto algorithm and parameters don’t require user input, than identical instances are simply redundant.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification.
The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding CRYP chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.
See examples under
Data.Contains data, typically the content of a file in the archive. It may be compressed (usually is), and may be encrypted.
Each compressed file is stored as one DATA instance.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification. There is a single uintV, specifying the file which is expected to contain all the specified instances of DATA.
For example, hex 0B 44 61 74 61 01 29 00 03 39 7D A4 50
length: 0B (12 bytes)
"Data"
Instance: 01, count 29 (instances 1 through 42 inclusive)
part: 00
data: 03 (meaning: look at file 3)
checksum: 397DA450
means that DATA instances 1 through 42 inclusive can be expected in file 3 of a multi-file archive. Note that the chunk header specifies 42 instances, but there is only one data value. It applies to all instances referred to by the chunk’s header. You cannot group forwarding chunks that forward to different places.
This is data used to "seed" a compressor. A dictionary-based compression engine may be pre-loaded with data that is representative of the content being compressed. If multiple files refer to the same dictionary, you save space.
There are standard dictionaries that are known to the decompressor (see §6.3 on page
*) and are not included in the file. That really saves space, when compressing source code. However, you can include your own dictionary in an archive, using this chunk.The Instance number may not be between hex 40 and hex FFF, inclusive, as these are reserved for built-in standard dictionaries.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification.
The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding DICT chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.
Specifies an "external" dictionary. There is no Payload Specification. The Payload data is a URL of a file that contains a corresponding DICT instance, or another DICt instance that further redirects the search.
The URL shall be "URL encoded" 8-bit data. E.g. a space will be represented as "%20".
No Payload Specification, always 4 or 8 bytes of data.
This chunk, if present, must be last in the file. Upon finding this chunk, no further chunks are read. This allows the parser to stop before the physical end of the file, so the ZIP2 data stream can be embedded in a larger file.
A second use is to find the beginning of the data stream. The data in this chunk is a single uint32 or uint64 value that, when subtracted from the starting byte of this chunk, indicates the starting point of the first byte in the data set (the opening "Z" in the signature). This is intended to support self-extracting archives where the data set is appended to an executable file. The file may be opened, end ENdX chunk (with a correct CRC) found as the last 19 (or 23) bytes in the file. It’s presence is interpreted as a signature, and its payload points to the beginning of the archive.
The instance number, range, and part number must all be zero. Implementations may rely on this chunk being exactly 19 bytes long if the physical file is less than 4 GB, and exactly 19 or 23 bytes otherwise.
The value is a fixed size, not a uintV, to facilitate reading it backwards. It’s simpler to always look 19 bytes from the end than it is to try each location looking for a matching pattern.
"FAT Attributes", actually FAT and NTFS style attribute bits. Goes with corresponding DATA instance.
A small bitmap, to be displayed by graphical environments to represent this archive.
The icon may be one of several standard icons (such as a squeezed file folder or a setup icon), an actual image, a pointer to an icon that’s in the archive already, or an external location.
All methods of representing the icon are found in the ICON chunk, rather than using ICON, Icon, etc. for different means. Likewise, if several ICON instances are present, they must all be grouped into a single chunk. This is to facilitate a shell or other GUI browser quickly and simply finding the ICON data. A program otherwise unaware of the ZIP2 format can step through the chunks using the lengths, and find the ICON chunk, then stop looking.
There is no Payload Specification. ICON data is never compressed (other than the compression implied by the bitmap’s data format) or encrypted.
The Payload is a single byte specifying the type of record, which is followed by data specific to that type.
Type 0: A standard icon. Data is a single uintV specifying which icon to use. See the file standard_icons.zip2 for actual data—each file in it is named n-x.?ng is a bitmap for standard icon number n. There may be multiple images (different values of x) for one value on n, and all are the same image at different resolutions and color depths. Pick one.
Type 1: An actual image. Data is a PNG or MNG file for the bitmap. There may be multiple instances of ICON with type 1 data, and all are assumed to be the same logical image. However, each instance may be a different resolution or color depth or feature set (e.g. animation). A browser can choose the one it likes the best.
Type 2: An internal icon. The data is a file name (no length prefix) of a file within this archive. The browser should use the icon it normally would use for that file, possibly extracting it to a temporary location.
Type 3: An external icon. The data is a URL of a ZIP2 file, whose ICON is to be used instead. Presumably that file will have multiple Type 1 ICON records and no others.
IDENThis identifies this file, especially when part of a multi-file archive.
The Instance number must be zero. The part number corresponds to the file sequence. There is no Payload Specification. The Payload data is contains a revision number (a uint16) and the original file name (text).
The revision number should be incremented when a program updates the file. This can be used to verify that all files in a multi-part archive are in sync (or that you’ve downloaded a matched set). The value is a fixed size to facilitate in-place updates. The value can simply roll over at 64K, if that ever happens.
Each file in a multi-part split archive will have a copy of this chunk, with part number 1 found in the first file of the set, part number 1 found in the second file, etc.
This is the "directory" of files stored in the archive. Instance 0 is the principle directory. Other indexes are used for subdirectories or alternate directories.
An alternate directory can be used to give a different view of what’s in the archive, exposing different files, or the same files with different names or attributes. In a multi-part archive, each file can contain an index showing what’s in that specific file, in addition to the principle directory, containing everything, found in one of the parts.
The Payload of this chunk contains a prelude, followed by a record for each file.
The prelude is:
Type:
byte. 1=main, 2=alternate, 3=subdirectory. A main directory is the principle directory if the instance number is zero, and a per-file portion directory otherwise.FileSystem:
uintV. This specifies the file system of the creator. The decoder can use this information to help interpret the contents. 0=unspecified, 1=FAT, 2=HPFS, 3=NTFS, 4=Macintosh, 5=UNIX (elf type binaries), 6=UNIX (other).Pragmatically, you can watch out for naming differences, know to expect resource forks for Macintosh, or multiple data streams for NTFS.
Count:
uintV. This is how many file records follow.Base_time:
time64. This is the starting value used by the delta times.Each record is:
Type:
byte. 0=file, 1=directory, 2=volume label, 3=alternate index, 4=subdirectory index. High-order bits give common attributes: 0x80=private, 0x40=read-only.A "directory" type indicates a file entry that is the name of a subdirectory rather than a file. It doesn’t have a DATA Instance, but other corresponding Instances (e.g. FaTR and SECU) may be present if applicable to the directory itself.
This contrasts with the "subdirectory index" type, which points to another INDX Instance. That index’s content is implicitly included into this one, with this entry’s name prepended to each entry in the so-included INDX.
See §7.3 for more on subdirectory usage.
Name:
lpstring. This is the name of the file, or the description of the index. Names use a universal (file-system independent) system for indicating subdirectories and other special things. See §7.2, Entry Names, for details.Data:
uintV. Where to find the contents data. For a file, this is the instance number of a DATA chunk. For a subdirectory or volume label, this value is ignored. For an index, this is the instance number of a INDX chunk.Date:
sintV. This is the timestamp, expressed as the number of seconds different from the previous file’s time stamp. Note that the timestamps are also found (optionally) in TiME chunks. This value should match (to its lower precision) the TiME’s value. If there is a discrepancy, this value takes precedence.Size:
uintV. The uncompressed size of the file. IndxIndicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of INDX.
Each instance holds a Macintosh Finder Info record (16 bytes) for the corresponding DATA instance.
This is native information on the Macintosh computers, specifying the file’s type and creator. If extracted on other platforms, this information can be used to add file extensions or otherwise indicate the file’s "type" on the target system.
Indicates that the MACF chunks are in another file in a multi-part archive.
There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of MACF.
An Instance of ROOT corresponds with an Instance of INDX, and supplements it. The value in the Payload of ROOT is an absolute file path that specifies where the files originally came from, or otherwise where they should be extracted to.
When absolute paths are needed (perhaps the archive represents a backup), don’t put absolute paths into the INDX. Instead, the paths in INDX are relative to a starting point, normally specified upon extraction. The value in ROOT is a suggested default for extraction.
The value in ROOT is specified using the actual syntax for the filesystem noted in the corresponding INDX. This allows things that cannot be represented in the abstracted simple tree structure. For example, drive letters and UNC names for DOS-inspired file systems can be used. However, the control codes as used with INDX file names may be used also, so this value could be file-system independent if there is no need for file-system specific features.
In addition, substitutions are made in this string, so it may specified a logical location without knowing its actual name. For example "the temp directory", or "the home directory" or "the system directory".
The substitution codes are specified by one of the words listed below (case sensitive), surrounded by a pair of hex 1E characters (RS). For example (to use C string literal syntax), "\x1eTEMP\x1e\x1cMydir" specifies the Mydir subdirectory in the "temp" directory, whatever that may be.
Here is a list of special places.
home:
The "home" directory.ProgramFiles:
The place under which programs are installed. Corresponds to the "Program Files" directory under Windows, and the /usr/bin directory under Unix.temp:
The directory for temporary files.Windows:
(Windows systems) The Windows directory. This is returned by the GetWindowsDirectory API function.WindowsSystem:
(Windows systems) This is returned by the GetSystemDirectory API function. RTFMA comment that applies to the entire archive. Instance 0 should be displayed to the user whenever the file is manipulated (such as a title bar or status bar), and it should be kept short. Instance 1 and higher are displayed when asked, and may be longer. Instance 2 is for a copyright statement. Instance 3 should contain a URL for contact information or home site, and no other characters.
This may be repeated in each file of a multi-file split archive.
Security information, Win32-style. The ACL information for all the files can become lengthy and redundant. Rather than specify a fancier indexing scheme to central ACL data, this simple representation is simply compressed. Remember, the Payload Specification specifies a compression method, and this chunk can use Deflation with a dictionary with excellent results.
This holds an SHA-1 digest (20-byte binary format) for arbitrary chunks in this archive.
The ShAI chunk type (below) holds information for a decoded file. However, that does not protect other chunks, such as INDX, against corruption or mismatched multi-file sets. This SHAC (SHA-1 for a Chunk) can apply checking to any other part of the archive.
Possible uses include putting in one file a single SHAC checking all other chunks in a multi-part spanning archive. Or, multiple SHAC instances can be used, each checking different things. A SHAC record in one file might check the previous file in the set. It’s completely flexible, and only real use will tell what forms are most useful for what purposes.
An Instance contains a 20-byte SHA-1 digest, followed by a set of records nominating which chunks are part of the data to digest. Each record consists of:
Type:
4 byte name field.Instance_number_range:
2 uintV.Part:
unitV.These values match the corresponding values in the header of some chunk in the archive (see §4.4.1).
The nominated chunk must match the definition of some chunk in the archive. That is, if there is a chunk present that’s type INDX, instances 3 through 7, part 0, then you can’t nominate INDX instance 4 only. It has to match—exactly—the actual chunk block in the archive.
The data input to the SHA-1 algorithm is formed by concatenating the nominated blocks together, in the order in which they are specified. The data includes the entire block header starting with the Length, through the checksum at the end, inclusive.
A SHAC chunk may not nominate any SHAC chunk.
This holds an SHA-1 digest (20-byte binary format) of the corresponding
DATA Instance. See http://www.itl.nist.gov/fipspubs/fip180-1.htm for more information on SHA-1.This was originally called SHAO, with "O" for "One" because SHA1 is not a legal chunk name. However, the O was confusingly similar to a zero, and it mentally reads as "SHA zero" instead of "SHA O(ne)". So the I for the last character looks like a 1, and is a roman-numeral One. Hopefully this is less confusing.
Although each chunk contains a CRC-32 checksum, that does not offer full verification of the restored file since a file may be broken into multiple chunks. Just because each chunk was correct doesn’t mean that the parts in a multi-part archive didn’t get mixed up. Also, the chunk-level checksum applies to the encoded data, so won’t catch errors in the encoding or decoding process.
To address this, and provide a stronger check in the simple cases, this chunk may be used. It holds a 20-byte "digest" of the original file data. Upon restoring a file, this digest may be verified.
SHA-1 was chosen because MD5 has been compromised. However, the future may bring further improvements in this area, and SHA-1 may be dethroned in favor of something else. So one design idea was to have a single "digest" chunk name (such as DGST) and it would contain a byte specifying which digest method was used, for each Instance. However, this would prevent the instances from being a constant 20 bytes each. Instead, a new chunk name will be introduced when a new digest algorithm is adopted.
It is recommended that a multi-part spanning archive always contain ShAI records for any file that spans archive parts, and that shAI (a split ShAI) not be split across files.
SHA-1 is only defined on inputs up to 264 bits, or 256 bytes. Since the ZIP2 file format has no such limitation on its component files, we need to address this. Compute multiple SHA-1 on different parts of the file and then combine them; or hash the whole file down to something smaller and then compute SHA-1 on that? It’s an open issue, but not an urgent one since a file that large would require a million hard disks. With exponential growth, this won’t be a problem for 20 years or so. Large data streams that do exist today (scientific data from measurements) ought to use more error checking than a single 20-byte hash, and much smaller blocks with independent error recovery.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of ShAI.
This holds a "digital signature" for the original file stored in the corresponding DATA Instance.
Important open issue: where do we get the keys and the code? There are numerous competing PKI and Certificate systems. PGP code is free and portable, and has a library that’s very good for programmatic use. Has the W3C endorsed or ratified anything yet?
Details TBA.
Besides the actual signature, meta data should include the manner in which the signature was made, and a "security level" number.
Security Level IndicatorThe "security level" is a number indicating how secure the signer is, as illustrated below.
The number are spaced apart so that additional levels can be added, and keep them in order from least to most safe. However, by defining only a few distinct levels, there will be less confusion about just how to rate something.
A value of 16 indicates that the ZIP program automatically signed the data, without direct user involvement. Specifically, any automatic script or batch file running on the system (possibly without the user’s knowledge) would produce signed files. This level of signing is not proof against a worm-type virus, as once the user is logged in, any background software can produce signed archives.
A value of 32 indicates that the user must interact with the system to produce a signature. That might just be a dialog box that says, "OK to sign using default signature?".
A value of 48 indicates that the passphrase was actually supplied by the user in order to access the private key. A script cannot do that without user interaction.
A value of 64 means that the public key is stored on a removable media, such as a CD-R or a smart card, and is only presented to the machine when it is momentarily accessed.
So, what’s to prevent an email worm virus from changing a 16 to a 48? The meta data must be stored inside the encrypted envelope, so that you cannot change a value without invalidating the signature.
While too heavy for script-type viruses, a Trojan Horse could include a full-blown ZIP2 creator program that’s been suitably modified to report the level wrong. So, the public keys themselves should be known to be secure to a particular level, offering a cross-check if necessary.
So if it can be defeated, why report the level at all? Because Windows 2000 has a "public key infrastructure" that seems to provide software running in a particular user’s account access to encrypt/decrypt ability without further prompting, once that user has logged on. If this is used to sign archives, there will be heavy use of level 16, which is useless against malicious scripts. This value was introduced to distinguish this kind of casual but useless signing from a stronger signature.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of SIGN.
Advanced timestamp information, goes with corresponding DATA instance number.
The Payload contains three time64 values, specifying the last-write time, creation time, and last-access time.
This is a "Table Of Contents", used to make skimming a zip2 file much faster.
Consider a program that skims the entire zip2 file, noting the position and header information of each chunk. This is easy to do, but causes a lot of seeks, going through the entire file. Seeking is slow on some media such as CD’s, and that is also a prime potential usage of zip2 files. Similarly, a remote file can generate many round trips over the network to seek and read a little bit at a time.
To prevent the need for skimming the entire file, the TOcN chunk may be used. This contains the header information (type and instance) plus the file position of each chunk. When skimming, the program can stop skimming when it sees the TOcN, and simply use the information contained within it instead.
The TOcN is always instance 0. It defeats the purpose to break it into multiple parts, so there is no tOcN chunk defined.
The content of the TOcN is a record for each chunk in the file (the TOcN chunk itself may be omitted, as can the TOcn if it precedes the TOcN in the file, but all other chunks shall be listed). Each record contains the Type (4 byte name field), Instance_number_range (2 uintV) and Part (uintV) from the corresponding chunk, and the position in the file (uintV) such that the opening "Z" in the signature is byte zero.
Now, it would only be useful if the TOcN were found at the beginning of the file. But since its own length would affect the position of everything else, and it will change as the file is updated, that is impossible to arrange unless the file is produced and written as a single creation process. In order to allow files to be updated without being entirely rewritten, the TOcN must be allowed to appear wherever there is room. To this end, the TOcn chunk is used.
The TOcn chunk points to the TOcN chunk. This can be placed at the beginning of the file (the first chunk is best), and updated in-place when the TOcN is regenerated and relocated.
The TOcn contains exactly 4 or 8 bytes of data, and is a uint32 or uint64 containing the offset of the TOcN chunk. It is a fixed size to facilitate in-place updates.
This is data used to "seed" a compressor. A Huffman Tree or similar compressor can use this information instead of including it with the compressed data.
There are standard trees that are known to the decompressor and are not included in the file. However, you can include your own tree in an archive, using this chunk. A single TREE can be useful for compressing all the file comments, for example.
The Instance number may not be between hex 40 and hex FFF, inclusive, as these are reserved for built-in standard trees.
There is no Payload Specification. The Payload data is a list of 256 uint16 values. If the data is shorter than 512 bytes, the remaining entries are implicitly hex 8000, a middle value.
This table is a set of weights, indicating the relative commonality of each byte. This may be used to generate a Huffman tree, but is a general enough specification that it is easily used for other algorithms.
Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.
There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of TREE.
Specifies an "external" tree. There is no Payload Specification. The Payload data is a URL of a file that contains a corresponding TREE instance, or another TREe instance that further redirects the search.
The URL shall be "URL encoded" 8-bit data. E.g. a space will be represented as "%20", and the protocol, such as "http:" shall also be present.
Advanced attributes, UNIX style. Gives permission bits for self, group, and all. Goes with corresponding DATA instance number.
This is a filler used to mark unused space within a file.
The Instance number, range, and part shall always be zero. There is no Payload Specification.
The Checksum is unique in this type in that it does not depend on the (ignored) Payload data. Instead, the Checksum is computed as the Length xor’ed with hex 58 58 58 58.
The data may be left as-is when a chunk is deleted—simply overwrite the 7 bytes after the Length (which is unchanged) with hex 58 58 58 58 00 00 00 and update the Checksum. It is not necessary to read the old data to compute the new Checksum, since it depends only on the Length.
Alternatively, the data can be replaced with all zeros. This takes more time when modifying the data set, but is advantageous because it can take advantage of "sparse file" disk allocation, or further compression during transmission.
To fill areas that are too small for a chunk definition (12 bytes), use fill bytes, as described in §4.2 on page
*.A chunk name beginning with "[" (becomes "(" when the "continued" flag is set) indicates data that corresponds to DATA instances. A program that doesn’t know what it is can still keep it associated with the correct file, by understanding this.
A chunk name beginning with "]" (becomes ")" when the "continued" flag is set) implies no such relationship, and programs that don’t know about it can’t manipulate it.
All names beginning with one of these 4 characters and continuing with 3 letters are user-defined, and will never have an official meaning in this specification. Any program generating user-defined chunks is expected to leave a "[ME]" chunk identifying itself, too.
The meaning of the first two letter’s flags must be respected precisely. The meaning of the last two letter’s flags are open to some interpretation, and ought to be used if they seem to make sense. Flag 2 must be off for "[" (associated with DATA) chunks, because implementations won’t know the implicit size of the individual values.
Built-In Chunk Definitions COMPInstance hex 40: Simply stored, no transformation at all.
Instance hex 41: Deflated (method 1), no dictionary.
Instance hex 42: UTR-6 (method 2), no parameters applicable.
Instance hex 43: Huffman encoding (method 4) using the English letter frequency tree (TREE instance hex 40).
Instance hex 44: Burrows-Wheeler Transform (method 6), no parameters applicable.
Not yet designed. At this point, I’m thinking that CRYP chunks will refer to crypto algorithms and parameters, in an analogous manner as COMP chunks do with compression algorithms.
The actual dictionary data is supplied in the
standard_chunks.ZIP2 file.Instance hex 40: No content. This is an empty dictionary that implies no pre-loading takes place.
Instance hex 41: C/C++ source code. All the reserved words and common extensions, in alphabetical order, separated by spaces.
Instance hex 42: Perl source code.
Instance hex 43: Meant for compressing
SECU chunks, it contains common ALC’s such as "Everyone".Instance hex xx: … more to be defined …
The actual tree tables are supplied in the
standard_chunks.ZIP2 file.Instance hex 40: English letter frequency data.
Notes on Archive Entries The Index Structure
Classic ZIP files use a forward slash character (‘/’) for a directory separator, regardless of platform. This is not good, because the slash is a perfectly legal character in Macintosh file names. In general, any choice of "special" character can clash with some system. So, this Name field uses characters in the C0 control code region as special markers, and provides a way to escape out any character, just in case.
The control characters are chosen because they have no graphic representation so are unlikely to be used in names, and they encode as one byte in UTF-8.
All characters in the range hex 00 through hex 1F inclusive (commonly called C0 Control Characters) are reserved. Three of them are defined in this specification, but the rest are reserved for future use. Any control character used in a Name must be escaped out.
The character hex 1B (ESC) is used to escape out special meanings for characters. The following character is taken as a letter, devoid of special meaning. For example, to include an actual ESC character, use ESC ESC.
The character hex 1C (FS) is used as the subdirectory separator character.
The character hex 1D (GS) is used as a stream-name separator. The name of a file may be followed by a GS character and the name of a sub-file. For NTFS file system, this specifies alternate data streams. For Macintosh systems, the stream-name of ‘
⌘’ (character hex 2318) specifies the resource fork.Files in subdirectories may be stored in separate indices, or qualified and stored in the same index.
… details and examples
We still need a "light duty" algorithm for compressing very short text, such as file comments and file names.
Deflation (method 1)The parameter data is a unitV specifying a
DICT instance. Note that DICT #0 is defined as empty, so this is how you use no dictionary.What Zlib does. Note that gzip and zip both require oddities—try and avoid that.
The code for zlib (written by by
Jean-loup Gailly and Mark Adler) is available on a huge number of platforms. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems. It’s already standardized as RFC 1950.See
http://www.cdrom.com/pub/infozip/zlib/zlib_docs.html for details.This is a method for representing Unicode data more compactly than UTF-8 when a lot of non-"basic Latin" characters are used. Simply encoding Unicode data as UTF-8 requires a single byte only for ASCII characters, and takes two bytes for the "General Scripts" area, and more bytes otherwise, including CJKV Ideographs. UTR-6 will encode General Scripts as one byte per character through "windowing" the desired character set into a range of 1-byte values.
For details of the algorithm, see Unicode Technical Report #6 Revision 3.1, "A Standard Compression Scheme for Unicode", located at
http://www.unicode.org/unicode/reports/tr6/tr6-3.1.html.Note that UTR-6 may be used in conjunction with other compression algorithms. For example, the result of UTR-6 may then be run through a Huffman tree.
The general-purpose Deflation method fails to give optimum results in certain cases. This simple method can be used preparatory to deflation to eliminate the pathological cases and in less severe cases enhance its performance.
For example, given a megabyte file containing mostly the letter "M", a standard ZIP file held 2201 bytes of file data. However, that ZIP file was itself zipped down to 208 bytes, a reduction of 91% ! Clearly, the first pass did not compress as well as it might have. This simple example illustrates the limitations of a dictionary-based compression system.
If the same file is run-length compressed as described below, the result is 12 bytes. It is hex 02 84 00 08 01 83 FE 4D 00 02 0D 0A to be exact.
More typically, run-length compression will save room in the dictionary, so deflation can, in general, work better.
Preliminary experiments indicate that Chromosome 1 of the Human Genome Project is a megabyte smaller when simply Huffman encoded than when deflated. The result can be deflated by another 5%.
The file in question contains mostly the letters G, A, T, and C. Deflation should be able to seize on the repeating copies of the "conserved sequences", assuming they fit in the sliding window. However, it doesn’t work as well as simply using two bits per character.
Point is, some kinds of data don’t deflate well.
A custom encoder could be devised that compresses these data files very well, identifying repeated sequences that would not fit in the deflation "window", and encode them as type 2 runs. The resulting zip2 file could be read with a standard extractor, even though a custom compressor was used.
This method combines several simple schemes, which may be intermixed within one data stream.
The data stream consists of a set of runs. Each run starts with a byte specifying its type, and is followed by data specific to that type.
Type 0 is "verbatim". It is a 0 byte, a uintV length, and then that many data bytes. The decoded output is simply those data bytes unchanged.
Type 1 is "byte run". It is a 1, a uintV length, and a value byte. The decoded output is length copies of the value byte.
Type 2 is "repeat". It is a 2, a uintV repeat count, a uintV length, and that many data bytes. The data bytes are decoded, recursively, as runs. The result of that is then repeated count times to produce the output.
Type 3 is "delta". It is a 3 byte, a uintV repeat count, and that many data bytes. The first data byte is copied to the output unchanged. Each subsequent byte is added to the previous output byte (modulo 256) to form the next output byte.
Types 4 through 8 are used together for a manual dictionary lookup system. Strings are noted the first time they are encountered, and referenced again some time later. An encoder can perform global analysis of the file, or take advantage of special domain knowledge about the file, to produce repeated strings that are too long or too far apart for Deflation to find. Although a special encoder was used, the standard decoder can be used. For example, a compressor for the Human Genome Data can list the known repeated sequences.
Type 4 is used to note a string the first time it is seen. A 4 byte followed by a uintV index number (must be used in order starting with 1 (zero has special meaning)) that has not been used before, followed by a uintV length, followed by that many data bytes. The data bytes are copied to the output (recursively decoded), and are also remembered for later reference. If a Type 4 record appears nested inside another record such that it gets repeated, subsequent instantiations behave as a Type 5, pulling the string from the memorized table, and the data bytes are skipped and not decoded again.
Subsequently, a 5 byte followed by a uintV that is an existing index number will cause the memorized string to be output.
For fancier use, a 6 byte is a copy-with-changes. A 6 byte is followed by a uintV index number, which refers to a memorized string, followed by a uintV length, followed by that many data bytes. The data bytes are recursively decoded, and the result, not counting the content of nested Type 7 records, must match the length of the memorized string. The content of the decoded data is then treated as a delta to the memorized string: each byte of this record is added (modulo 256) to the corresponding byte of the memorized string, and the sum is output.
Within a Type 6 record, you may use Type 7 records to insert or remove content. A 7 byte followed by a positive sintV length followed by that many bytes of data is treated as an insertion, which is spliced into the memorized string at that point. The data is recursively decoded, and the result is spliced in at the position of the record, not consuming positions in the Type 6 delta string.
If a 7 byte is followed by a negative sintV length, the absolute value of that length is deleted from the memorized sequence. The content of the Type 6 delta string will be that many byte shorter, and should not contain delta bytes for the omitted positions.
When a memorized string will no longer be needed, it can be deleted from the memorized table by using a 8 byte followed by a uintV index number. This is used to save memory in the decoder, if desired.
The parameter data is a uintV naming a TREE instance.
Using data from a TREE chunk, a Huffman tree is created so that the most common letters have the smallest bit sequences. The output data is a set of blocks. Each block is a length (uintV) followed by the result of looking up the bit sequence on each byte of input, with trailing zeros added to make the output a multiple of 8 bits. A single zero byte (looks like a block length of zero) indicates the end of data.
The length is needed because the tailing padding bits could otherwise be confused for more data when decoding. The length is a uintV of the number of bytes in the unencoded input. That is chosen over the number of bits in the output so it may be emitted before knowing how long the output will be. The decoder will need to stop when the length reaches this expected value, even if there are bits remaining in the encoded data stream.
The wrapping blocks are used because the length is needed. By using multiple blocks, you don’t need to know the length of the input in advance (a requirement). You can compress each buffer-full independently, and output a 0 when you reach the end.
The bytes of encoded data are created from each group of 8 bits, the first bit in the stream being the high bit in the byte.
To form the Huffman tree, start with the
TREE data set as leaf nodes. This forms an initial list of 256 nodes.While there are more than one node in the list, find the two nodes with the smallest weights. Remove them from the list and form a new non-leaf node that points to these two nodes, the zero branch being the leaf node that was closer to the beginning of the list, and the one branch being the other one. The weight is the sum of the weights of the combined nodes. Add this new node to the end of the list.
The bit string emitted for a byte is defined as the branch choices to get from the root to that byte’s leaf node.
Composition (method 5)This powerful feature allows multiple compression algorithms to be chained together. This means that some compression methods can be designed as preliminary filters that are meant to be applied before deflation.
The parameter data is a list of steps to compose. Each item in the list is a uintV length of that item, a uintV compression method, and the method-specific parameter data.
For example, to specify Run-length compression followed by a Huffman tree using TREE #6, finally fed into deflation with no dictionary, the data would be:
03 00 04 01 06 01 01 00
What libbzip2 does (see
http://sourceware.cygnus.com/bzip2/index.html). This is a complete compression method centered on the Burrow-Wheeler Transform.The documentation on libbzip2 indicates that some improvements are possible but were omitted to maintain backward compatibility in the bzip file format. However, such improvements could be made for use within the ZIP2 file. This needs to be investigated.
The code for libbzip2 (by Julian Seward) is written in ANSI C and works on a variety of UNIX platforms and Windows. A Macintosh version has been found at
http://persephone.cps.unizar.es/~spd/bzip2/. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems.An upcoming version of zlib is supposed to include BWT as a major new feature. Since it’s still in progress, it could be written to learn from libbzip2 and improve upon it. If ready in time, it could be a better candidate for use in this specification.
This algorithm gives better results (around 5%) than deflation for large files. For small files, deflation works better. When a suitable dictionary is available, small files deflate much better.
EncryptionZIP2 features an orthogonal model for encryption, in that any chunk having a Payload Specification can be encrypted.
As with classic ZIP files, one method of encrypting a file is such that anyone with the right pass-phrase can decrypt it. That is, the session key is generated entirely from the pass-phrase.
Another way is that the session key is encrypted using one or more public keys, and is targeted to only be read by the holders of the corresponding private keys.
Although freely-available portable code exists for strong encryption algorithms and it should be fairly simple to just choose a library, it is probably wise to develop this feature in conjunction with the digital signature capability, which is more difficult, in that it needs to work with existing key infrastructures.
This is the same CRC-32 algorithm and polynomial as used in ZIP files.
The code for computing the CRC always gets the polarity of the bits the other way around. That is, you have to ones-complement the word to get the right answer, and undo that on input to the next buffer-full.
So ZIP2 files represent the result using this more natural bit polarity. That is, a ZIP2 checksum will be bit-for-bit the exact opposite of the ZIP checksum. In C syntax,
zip2_CRC= ~ compute_zip_CRC (data);
The UTF-8 Encoding Form of Unicode data is described in the Unicode Standard. The 2.1 book gave a detailed description in an appendix; the 3.0 book gives a hurried description in paragraph D36 in section 3.8.
This is formally described in RFC 2279, available at
ftp://ftp.isi.edu/in-notes/rfc2279.txt or wherever Internet Standards are found.Since the whole idea of the ZIP2 file format is to save room, it is wasteful to store values with a larger number of bytes when a small number is generally enough. Also, since it is designed with the future in mind, what is the upper limit of the value’s size? For example, 32 bits might be a good choice for a length value now, but insufficient in the near future. So it is natural that many integer values will be stored as variable-sized integers. It will be short when possible, keeping the file efficient. However, it may be long when necessary.
My first idea was to simply use UTF-8 encoding. However, this is less efficient for small integers than it could be (two byte forms only goes up to 2K) and is limited to handling numbers up to 31 bits. A design goal was to handle 64 bits, so this would not do.
Since I was looking for something different anyway, it became clear that the design considerations for UTF-8 were simply not applicable for my needs. A simpler format would work quite well, and offer other benefits in exchange.
Here is the encoding, in a nutshell.
A byte with its high bit cleared is a one-byte value, holding a number from 0–127. If the high bit is set, the next highest bit indicates that the value is two bytes long when clear, longer when set.
This generalizes to a scheme where the first
n-1 bits are 1, followed by a 0 bit, followed by value data. There is a total number of n bytes.
Encoded Length |
Bits for the Value |
1 |
7 |
2 |
14 |
3 |
21 |
4 |
28 |
5 |
35 |
6 |
42 |
7 |
49 |
8 |
56 |
9 |
63 |
10 |
70 |
n |
7*n |
So a two-byte value can hold up to 16K, which is sufficient for ordinals most of the time. A 4-byte value can hold up to 256 Meg, which is sufficient for most length values, so lengths are no longer than they would be if variable-length encoding were not used. Full 64-bit values may be represented with 10 bytes, if required. Values may be larger than 64 bits, under this definition.
This is much more efficient than UTF-8 encoding for these kinds of values. Things were two bytes would be used anyway still take up only 2 bytes; things were four bytes would be used anyway still take up only 4 bytes; yet you have the flexibility of using values that are larger than what was supposed to be typical when the format was drafted.
The encoding is not unique. For example, the value 42 could be encoded as two bytes hex 80 2A, or as the single byte 2A. If you always use the shorter form, it follows that 80 2A is wasteful as it could be used for something else.
You could improve the algorithm by not duplicating all shorter values’ meanings. Specifically, a two byte encoding could be biased to encode a value from 128 to 16511, rather than from 0 to 16383, since the latter is redundant as 0–127 is handled just fine by a one byte encoding.
However, the range extension is less than 0.8%, so won’t contribute much to keeping the encoded values in the file short. It complicates the encoding logic, in terms of finding the smallest bit length. More seriously, it makes it harder to explain. Most critically, it makes the code more complex.
Given a byte stream queued up to the first byte in a variable-length integer,
Unsigned values from 0 through 127 are encoded as a single byte of the same value.
Signed values will have the sign bit (to be replicated) immediately following the first 0 bit.
A value up to 16,383 is encoded as binary 10xxxxxx xxxxxxxx.
A value less than 221 is encoded as binary 110xxxxx xxxxxxxx xxxxxxxx.
ConformanceIn order to make sure that any decoder can read any ZIP2 file, here is a set of requirements, options, and assumptions. This goes beyond what is listed in the main part of this specification.
Baseline 2000This details a "correct" file. It is suffixed with the year of definition, so that future revisions can be called "Baseline 2005" or whatever. It is expected that changes will be made to accommodate actual issues found in the field, and that lengths and number sizes may increase with time.
This details the behavior expected from a "robust" (that is, not flaky or buggy) implementation.