Summary

We need to do a better job of testing Ubuntu on server hardware. To do this we need to:

We will cover official Dapper server certification in this spec. Community server testing has moved to CommunityServerHardwareTesting, as per MattZimmerman's request.

Rationale

We're putting a lot of effort into making Dapper rock on servers. Being an enterprise-ready release, we'll be supporting it for five years on servers -- but none of this is much good if we can't guarantee it will run properly on modern server hardware.

Use cases

Community testing use cases are addressed in the community testing spec.

Scope

We would like to certify a minimum of 25 servers in the Dapper timeframe.

Implementation

Certification facility

The Harvard Computer Society will run the central Ubuntu certification facility in Cambridge, MA. HCS will:

Following testing, the servers will be tasked to do non-critical functions for Ubuntu and HCS, such as providing an Ubuntu archive mirror, or web serving. These services can be easily shut down when Ubuntu developers need to make use of a server to troubleshoot problems.

IvanKrstic will run the certification facility.

Installation testing

Installation testing does not require developing any new software. Certification facility staff will plop in an Ubuntu-server CD, watch the installation through completion, and make sure the machine was installed properly after rebooting.

We eventually want to have a d-i rescue mode profile for server testing. Booting into it would ask people to answer a few questions (which hardware does the system really have, vs. what the system detected automatically), and deliver the result to us. At first, knowledgeable testing facility staff can perform this by hand and a few custom-written scripts.

Burn-in testing

We will have a minimal, easily developed burn-in test suite for Dapper.

It will contain:

A certification burn-in run will be structured as follows:

Burn-in and installation test runs are collected in the HCS-developed server hardware testing catalog. Use of the catalog for community testing is explained in CommunityServerHardwareTesting.

Load testing UI

The test suite is wrapped in a shiny ncurses UI that, when started, asks the user whether she wants to perform a full burn-in (7 days) or a micro-burn-in (7 hours). A 7 hour micro-burn-in is assumed to be acceptable for community testing, and runs on the same schedule as a certification burn-in, with days scaled to hours.

The UI would run iperf(1) and stress(1) in verbose logging mode, and after a completed burn-in run, would offer to upload the results to the community server hardware testing catalogue via HTTP. The official certification facility would cancel this upload, and upload the logs to the certified server hardware catalogue manually.

Because a failed burn-in test often freezes or reboots the machine, the application needs a way to keep test checkpoints. It should write out a checkpoint to disk every 1 hour and at the completion of every test (which resets the timer). The checkpoint file will only be appended to, and so will contain record of any restarted runs; this checkpoint file will also be uploaded to the catalog, which will parse it to see if any tests failed.

The application needs to read the check-point file, if it exists, at every start: if it determines a test was interrupted, it should offer to start from the interrupted test instead of starting from scratch. A user-interupted test will be specifically mentioned in the checkpoint file, such that it can be differentiated from unexpected test interruptions due to machine reboots or freezes.

Custom install CDs

We may want to produce install CDs tailored to specific certified hardware. Vendors would pay for the creation of these CDs, possibly as part of the certification process, which would then be available to customers for free. We would base the customized CDs on the hardware list and testing results obtained from our official server certification run.

Timeframe

It would be at least 6-8 weeks before hardware could start shipping from vendors to the HCS certification facility. This means the server test suite needs to be completed by the end of the year.

HCS can start receiving and processing hardware in approximately as many weeks. However, actual certification runs can't start before February 1st, 2006. The gap is a one-time setup cost, and will not exist for future releases. This leaves two months for server certification, and since a full certification run (including burn-in testing) takes one week, two months should be more than enough time.

Outstanding issues


CategorySpec

TestingServerHardware (last edited 2008-08-06 16:30:16 by localhost)