apt-sync

Revision 4 as of 2006-06-13 14:28:19

Clear message

Summary

Succinct is based on rsync/zsync, extended to specifically handle .deb files. This makes it possible to update existing Debian packages without downloading redundant data, common to both the original and updated package. An alternative solution would be to use dedicated patch files. An advantage of Succinct over patch files, is that separate patch files are not necessary, since this method automatically identifies and downloads only those portions of a .deb package that are different to the version available. Computing which blocks of .deb files have changed is handled entirely on the client side, so the server will not have to deal with the overhead of creating patch files on a per-client basis.

Further information about Succint can be found at:

  • [:PaulSladen/Succinct]

For discussion on the patch approach, see:

This project is part of Google Summer of Code 2006, and is supervised by Michael Vogt.

Rationale

To save bandwidth when distributing updates.

Use cases

Sandy Slowspeed wants to apply the latest security updates for Dapper Drake, upon realizing that it would take over 12 hours to download the 300 MB required on a 56k modem, Sandy decides not to update after all.

Scope

Design

This project is based on the rsync algorithm as it is implemented by zsync (with all computation done on the client side). It is extended to work effectively on the .deb package format.

The rsync algorithm is used to identify which blocks of data are the same in the Debian file available, and the updated Debian file on a remote server. The identification occurs by calculating checksums over blocks of data within the two files, and identifying common sequences. A detailed explanation of the algorithm used by rsync can be found at:

The implementation of rsync does all calculations and comparisons on the server side, and a dedicated rsync server must run on the server. The high overhead on the server-side can make it unsuitable for widespread distribution of files. zsync is an alternative implementation based on the rsync algorithm, which does all comparisons on the client side. This is achieved by the server doing a one-time calculation of the checksums of all blocks in the file that needs to be distributed, and stores these checksums to another file. This file then contains all necessary data required for a client to calculate what data is required and what data is already available, so the client can do all calculations. Another advantage of this approach is that no dedicated rsync server needs to be run on the server, as HTTP can be used for all data transfer. Further explanation of the zsync program can be found at:

The rsync algorithm can be ineffective when sending compressed data, since compression can radically alter common sequences of code within a file. Debian packages contain two gzipped tar files, one for data and another for control information. Because of the compression, rsync can be ineffective if applied directly to Debian packages. The succinct project is going to identify possibilities for finding common sequences of code between two Debian packages despite the compression, and implement the best solution in a program based on zsync. Loosely speaking, zsync will be extended to "understand" Debian packages.

Further, a patch to APT will be provided so that it is able to use succinct when downloading updates from a remote server.

Sample Data

The table below shows the performance of zsync for transferring the data.tar.gz file from some Debian packages when it is uncompressed, compressed as gzip and compressed as gzip with the --rsyncable option. Note that by default zsync automatically looks inside compressed gzip files, this can be prevented using the -Z option (which will treat the gzipped file as normal binary data). In the table below, the performance on compressed data is shown with -Z and without.

Using the automatic zsync look-inside for standard gzip files repeatedly produces the best results (the least data needs to be fetched), the problem with it is that zsync can not guarantee that the exact same file will be reproduced (such that checksums match).

Implementation

Code

Code will be written in Python where possible/sensible.

Data preservation and migration

Outstanding issues

  • How best to identify common segments of data between two gzipped files. A "look-inside" mechanism could be used or a --zsync option to gzip.
  • APT must be able to validate an updated .deb package. The MD5 sum must therefore match the original package. Another (depreciated) option would be to modify APT to handle different MD5 sums.

BoF agenda and discussion

I think good thing will be to fetch Packages.gz using delta-compression too. It is done in new version of apt-get in debian.

rsync:

zsync:

Discussions on the debian-devel list:

Some thoughts from Martin Pool (of rsync):


CategorySpec