apt-sync
Launchpad Entry: https://launchpad.net/distros/ubuntu/+spec/succinct
Created: 2006-05-26 by FelixFeyertag
Priority: NeedsPriority
People:
Contributors: FelixFeyertag
Interested:
Status: Braindump
Depends:
Dependents: FullSearch()
BoF sessions: none yet
Packages affected:
Summary
Succinct is based on rsync/zsync, extended to specifically handle .deb files. This makes it possible to update existing Debian packages without downloading redundant data, common to both the original and updated package. An alternative solution would be to use dedicated patch files. An advantage of Succinct over patch files, is that separate patch files are not necessary, since this method automatically identifies and downloads only those portions of a .deb package that are different to the version available. Computing which blocks of .deb files have changed is handled entirely on the client side, so the server will not have to deal with the overhead of creating patch files on a per-client basis.
Further information about Succint can be found at:
- [:PaulSladen/Succinct]
For discussion on the patch approach, see:
- [:APTPackageDeltas]
- [:APTPackageDeltas/talk]
This project is part of Google Summer of Code 2006, and is supervised by Michael Vogt.
Rationale
To save bandwidth when distributing updates.
Use cases
Sandy Slowspeed wants to apply the latest security updates for Dapper Drake, upon realizing that it would take over 12 hours to download the 300 MB required on a 56k modem, Sandy decides not to update after all.
Scope
Design
This project is based on the rsync algorithm as it is implemented by zsync (with all computation done on the client side). It is extended to work effectively on the .deb package format.
The rsync algorithm is used to identify which blocks of data are the same in the Debian file available, and the updated Debian file on a remote server. The identification occurs by calculating checksums over blocks of data within the two files, and identifying common sequences. A detailed explanation of the algorithm used by rsync can be found at:
The implementation of rsync does all calculations and comparisons on the server side, and a dedicated rsync server must run on the server. The high overhead on the server-side can make it unsuitable for widespread distribution of files. zsync is an alternative implementation based on the rsync algorithm, which does all comparisons on the client side. This is achieved by the server doing a one-time calculation of the checksums of all blocks in the file that needs to be distributed, and stores these checksums to another file. This file then contains all necessary data required for a client to calculate what data is required and what data is already available, so the client can do all calculations. Another advantage of this approach is that no dedicated rsync server needs to be run on the server, as HTTP can be used for all data transfer. Further explanation of the zsync program can be found at:
The rsync algorithm can be ineffective when sending compressed data, since compression can radically alter common sequences of code within a file. Debian packages contain two gzipped tar files, one for data and another for control information. Because of the compression, rsync can be ineffective if applied directly to Debian packages. The succinct project is going to identify possibilities for finding common sequences of code between two Debian packages despite the compression, and implement the best solution in a program based on zsync. Loosely speaking, zsync will be extended to "understand" Debian packages.
Further, a patch to APT will be provided so that it is able to use succinct when downloading updates from a remote server.
Sample Data
The table below shows the performance of zsync for transferring the data.tar.gz file from some Debian packages when it is uncompressed, compressed as gzip and compressed as gzip with the --rsyncable option. Note that by default zsync automatically looks inside compressed gzip files, this can be prevented using the -Z option (which will treat the gzipped file as normal binary data). In the table below, the performance on compressed data is shown with -Z and without.
Sample Data
Package
Version
Uncompressed
Gzip compressed (with/without zsync -Z option)
Gzip compressed with --zsyncable option (with/without zsync -Z option)
Old
New
Local
Fetched
Local
Fetched
Local
Fetched
firefox
[ftp://ftp.ubuntu.com/ubuntu/pool/main/f/firefox/firefox_1.0.7-0ubuntu20_i386.deb 1.0.7-0ubuntu20_i386]
[ftp://ftp.ubuntu.com/ubuntu/pool/main/f/firefox/firefox_1.0.8-0ubuntu5.10_i386.deb 1.0.8-0ubuntu5.10_i386]
12302336
10944576
61440/12302336
8436056/4214131
3164160/12302336
5432743/4346548
firefox
[ftp://ftp.ubuntu.com/ubuntu/pool/main/f/firefox/firefox_1.0.8-0ubuntu5.10_i386.deb 1.0.8-0ubuntu5.10_i386]
[ftp://ftp.ubuntu.com/ubuntu/pool/main/f/firefox/firefox_1.5.dfsg+1.5.0.4-0ubuntu6.06_i386.deb 1.5.dfsg+1.5.0.4-0ubuntu6.06_i386]
1886208
20280600
0/1886208
7930655/7138161
391168/1886208
7621814/7255178
gimp-data
[ftp://ftp.ubuntu.com/ubuntu/pool/main/g/gimp/gimp-data_2.2.2-1ubuntu5_all.deb 2.2.2-1ubuntu5_all]
[ftp://ftp.ubuntu.com/ubuntu/pool/main/g/gimp/gimp-data_2.2.8-2ubuntu6_all.deb 2.2.8-2ubuntu6_all]
3715072
1772697
92160/3715072
1994886/674701
581632/3715072
1477323/736132
openoffice.org-bin
[ftp://ftp.ubuntu.com/ubuntu/pool/main/o/openoffice.org/openoffice.org-bin_1.1.2-2ubuntu6.1_i386.deb 1.1.2-2ubuntu6.1_i386]
[ftp://ftp.ubuntu.com/ubuntu/pool/main/o/openoffice.org/openoffice.org-bin_1.1.3-8ubuntu2.3_i386.deb 1.1.3-8ubuntu2.3_i386]
27041792
96178842
0/29054976
42036363/32388989
7841792/29054976
34405598/32924707
ubuntu-artwork
[ftp://ftp.ubuntu.com/ubuntu/pool/main/u/ubuntu-artwork/ubuntu-artwork_0.2.24-1_all.deb 0.2.24-1_all]
[ftp://ftp.ubuntu.com/ubuntu/pool/main/u/ubuntu-artwork/ubuntu-artwork_0.2.27-1_all.deb 0.2.27-1_all]
2899968
5324148
251904/2899968
5743360/5036070
593920/2899968
5410650/5061194
Using the automatic zsync look-inside for standard gzip files repeatedly produces the best results (the least data needs to be fetched), the problem with it is that zsync can not guarantee that the exact same file will be reproduced (such that checksums match).
Implementation
Code
Code will be written in Python where possible/sensible.
Data preservation and migration
Outstanding issues
- How best to identify common segments of data between two gzipped files. A "look-inside" mechanism could be used or a --zsync option to gzip.
- APT must be able to validate an updated .deb package. The MD5 sum must therefore match the original package. Another (depreciated) option would be to modify APT to handle different MD5 sums.
BoF agenda and discussion
I think good thing will be to fetch Packages.gz using delta-compression too. It is done in new version of apt-get in debian.
External Links
rsync:
zsync:
Discussions on the debian-devel list:
Some thoughts from Martin Pool (of rsync):