Diff for "ReliableRaid"

ReliableRaid

Differences between revisions 49 and 50

Launchpad Entry:
Created: 2009-12-15
Contributors:
Packages affected:
See also: BootDegradedRaid, HotplugRaid
- Fedora's solution: http://www.mail-archive.com/initramfs@vger.kernel.org/msg00978.html http://osdir.com/ml/linux-raid/2009-04/msg00090.html

Summary

RAIDs (Redundant arrays of independent disks) allow systems to keep functioning even if some parts fail. You just plug more then one disk side by side. If a disk fails the mdadm monitor will trigger a buzzer, notify-send or send email to notify that a (new spare) disk has to be added to up the redundancy again. All the while the system keeps working unaffectedly.

Release Note

Event driven raid/crypt startup. (Hotplugging that supports booting more than only the simple "root filesystem directly on md device" case, if arrays are degraded.)

Rationale

Unfortunately Ubuntu's md (software) raid configuration seems to suffer from a little incompleteness.

The assembling of arrays with "mdadm" has been transitioned from the Debian startup scripts to the hotplug system (udev rules), however some bugs defy the hotplug mechanism and other things that are generally expected (as in just works in other distros) are missing functionality in Ubuntu:

Luks on raid. If luks is installed on a raid mirror the raid gets destroyed eventually.

Bug #531240 breaking raid: root raid_member opened as luks

No handling of raid degradation during boot for non-root filesystems (at all). (Boot simply stops at a recovery console)
- The Debian init script has been removed but no upstart job has been created to start/run necessary regular (non-rootfs) arrays degraded. 259145 non-root raids fail to run degraded on boot
Only limited and buggy handling of raid degradation for the rootfs.(Working only for plain no lvm/crypt md's and after applying a fix from the 9.10 release notes.).
- The initramfs boot process is not (a state machine) capable of assembling the base system from devices appearing in any order and starting necessary raids degraded if they are not complete after some time.
  - 491463 upstart init within initramfs (Could handle most of the following nicely by now.)
  - 251164 boot impossible due to missing initramfs failure hook integration
  - 136252 mdadm, initramfs missing ARRAY lines
  - 247153 encrypted root initialisation races/fails on hotplug devices (does not wait)
  - 488317 installed system fails to boot with degraded raid holding cryptdisk
  - The proper mdadm --incremental option does not work in initramfs (not creating device nodes) 251663
No notification of users/admins about raid events is enabled (email question suppressed during install without any buzzer/notify-send replacement.)

Note that no problem arises in a hotpluggable system if an array is degraded and a drive comes up later. It can simply be (re-)added to the array (and will be synced in the background if any writes have occurred). The admin however should get a notification if a drive did not come up timely.

There really isn't any problem that requires a rescue console/repair prior to boot, if a disk fails *while the system is powered down*. There is however a problem of not notifying anybody in all other cases of disk failures (system stays running and booting straight up in those cases without any notification anyway).

Possible tasks that do require an admin action *after* the raid has done what it is designed to (saving ass in case of failure) can be:

Forcibly re-adding a drive marked faulty to the array (occasional bad block error that modern hard drives will remap automatically). In an hotplug environment this can simply translate to detach an re-attach a device.
Replacing a faulty drive with another one, partitioning it if necessary, and adding it/them to the degraded array(s).

Use Cases

Angie installs Ubuntu on two raid arrays one raid0 containing a lvm VG for the root filesystem (/) and swap, and one raid1 (mirror) containing a lvm VG for /home. When one of the raid mirror members fails/is detached while the system is powered down: The system waits 20 seconds (default) for the missing member then resumes booting with a degraded raid emits the notifications by means of beeping, notify-send, and email (configurable). When the raid mirror member is re-attached later on (hotpluggable interface) it gets automatically synced in the background.
Bono does the same with his laptop and one external drive, but uses lvm on top of cryptsetup on the raids. After being on the road using the laptop he reconnects his laptop to his external peripherals on his desk (including the disk drive) *prior to powering it up*.
HotplugRaid

Design

Event driven degradation for mdadm should be possible with a simple configuration change to the mdadm package to hook it into upstart so a raid is started degraded if it hasn't fully come up after a timeout. (Would result in appropriately replacing the second mdadm init.d script present in the Debian package. (Instead of dropping it.))

cryptsetup is already set up event driven (with upstart not yet with udev)

For event driven behaviour in the initramfs: The inintramfs scripts and their failure hooks look like way too much work and overcomplicating things. A event based boot would have to be reimplemented with the initramfs scripts instead of using upstart to set up (crypt, raid, lvm, ... and) the rootfs from initramfs. It would be good to adapt the upstart approach taken for 1) to set up the rootfs within the initramfs.

cryptsetup will need to be converted to the event driven setup in initramfs

Implementation

Package mdadm needs scripts to supply the initramfs with a MIN_MD_COMPLETION_WAIT value and the "dependency tree of UUIDs of arrays" necessary for the rootfs when mkinitramfs is called.
During rootwait when time_elapsed == MIN_MD_COMPLETION_WAIT (raid_start_degrated event) do:
- If a next level in the dependency tree exists and the remaining root delay timer is lower then MIN_MD_COMPLETION_WAIT the rootdelay_timer is increased by MIN_COMPLETION_WAIT.
- The degraded arrays of the current dependency level are started degraded. (About event driven initramfs see https://bugs.launchpad.net/ubuntu/+source/cryptsetup/+bug/251164/comments/15 ...)
Using the legacy method to start degraded raids selectively (mdadm --assemble -uuid) will break later --incremental (re)additions by udev/hotplugging. (The initramfs currently uses "mdadm --assemble --scan --run" and starts all arrays available degraded! (The corresponding command "mdadm --incremental --scan --run" to start *all remaining* hotplugable raids degraded (something still to execute only manually if at all!) does not start anything. 244808)
- The proper command (i.e. for boot scripts, --incremental --run --uuid) to start *only specific* raids degraded in a hotpluggable manner may not be available yet. (i.e. to start only the rootfs degraded after a timeout from initramfs) 251646 (Workaround maybe: to remove a member from the incomplete array and re-adding it with --incremental --run)
mdadm still reads/depends on a static /etc/mdadm/mdadm.conf file containing UUIDs (in the initramfs !!!). It refuses to assemble any hotplugged array not mentioned and tagged with the own hostname. (It does not default to just go assembling matching superblocks and run arrays (only) if they are complete.) This behaviour actually breaks the autodetection of every array newly created on a system, as well as connecting a (complete) md array from another system. (Bug 252345) For instructions on updating the initramfs refer to: http://ubuntuforums.org/showthread.php?p=8407182
For hotpluggable systems, it is a reasonable default to automatically re-add members when they are re-attached (udev event) after getting out of sync. If a drive gets marked faulty even due to block errors repeatedly it is of course time to add a new/other drive to that array (manually if you have not prepared a spare disk already). If the mdadm udev rule that fires "mdadm --incremental $device_name" returns "mdadm: failed to open $raid_device: Device or resource busy" ($raid_device is already running degraded), it should issue "mdadm --add $raid_device $device_name" to re-add the re-attached member.
The mdadm needs to supply proper udev rules to clean up member devices of raids when they are detached.
Ubuntu should make use of partition capable arrays (/dev/md_dX type).
The Ubuntu server manual says and claims that "If the array has become degraded, due to the chance of data corruption, by default Ubuntu Server Edition will boot to initramfs after thirty seconds. Once the initramfs has booted there is a fifteen second prompt giving you the option to go ahead and boot the system, or attempt manual recover." However,...
- The kernel will never autostart a raid that is not reproducing a correct checksum, no matter if degraded or not. There is nothing to manually recover about a degraded raid before it can be started. A recovery console is appropriate *after* starting a raid degraded has failed.
- A replacement disk can be added at any time if a spare disk isn't already installed from the beginning. If the disks are not connected over a hotpluggable interface, the system must be powered down for this. (Recovery console is also pointless in this case.)
- If a drive fails while the system is powered up, by default nobody is notified and the system will simply reboot degraded afterwards anyway. Reason: 244810 inconsistency with the --no-degraded option.
- The boot process will usually not be stopped (and should not) for something (like adding and syncing a new drive to the raid) that is designed to be done on live systems (quite a good thing to do as default).

The possibility that large server RAIDs may take minutes until they come up, but regular ones are quick, can be handled nicely:

"NOTICE: /dev/mdX didn't get up within the last 10 seconds.

We continue to wait up to a total of xxx seconds complying to the ATA
spec before attempting to start the array degraded.
(You can lower this timeout by setting the rootdelay= parameter.)

Press escape to stop waiting and to enter a rescue shell.

This functionality is similar to and could most easily be added in (the temporary tool) mountall.

We've tried to avoid "fallback after a timeout" kind of behaviours in the past.
- However crypsetup currently needs to and is timing out in initramfs (since it's not event driven). And the raid setup needs to timeout waiting for full raid dicovery, for deciding about degrading. The classic implementation uses a second startup script later in the boot up process. (But it has been silently dropped in Ubuntu without a proper replacement.)
How would you decide what device is needed?
- This may be about the only reason for keeping a /etc/mdadm/mdadm.conf like file around in a hotpluggable system. (watchlist of the arrays required to boot. -> wait for them and run them degraded after a while, if they did not come up fully.)
  - The determination of raid-deps would be a command similar to update-grub. The determined raid-watchlist needs to be saved in the root filesystem (/etc/...) for raid_deps of non-root filesystems, and in the initramfs for the root filesystem. (copied there during update-initramfs)
  - The watch list with required md devices may not be explicitly available in the fstab, but has to be determined like this (pseudo code):
    - For lvm and crypt devices "dmsetup deps" returns the major/minor of the parent device and "mdadm --query /dev/block/x:y" can tell what kind of md device it is. (get_lvm_deps() from /usr/share/initramfs-tools/hooks/cryptroot uses this to find crypt devices.)
    - Since md devices do not use the device mapper but (if separate bitmaps are required) will depend on other md devices themselves, those dependencies need to be looked up in /proc/mdadm or with "mdadm --detail" separately.
```
get_raid_deps(child_dev) -> list-of-raid-deps {
  if ['mdadm --query child_dev' == IsRaid]
     push child_dev to list-of-raid-deps
     for all member-devices gotten from 'mdadm --detail child_dev'
         push get_raid_deps(member-device) to list-of-raid-deps
     done
  done
  if ['dmsetup deps child_dev' returns something]
     push get_raid_deps('dmsetup deps child_dev') to list-of-raid-deps
  done
}

        
get_list-of-raids-to-run-if-degraded() -> raid-watchlist {
  blkid -g 
  for all bootwait filesystems
    if deviceID contains "="
      dev_name = blkid -l -o device -t deviceID
    else dev_name = deviceID

    push get_raid_deps(dev_name) to raid-watchlist
  done
}
```

UI Changes

None necessary.

Code Changes

Code changes should include an overview of what needs to change, and in some cases even the specific details.

Migration

Include:

data migration, if any
redirects from old URLs to new ones, if any
how users will be pointed to the new way of doing things, if necessary.

Test/Demo Plan

Adapt: http://testcases.qa.ubuntu.com/Install/ServerRAID1

It's important that we are able to test new features, and demonstrate them to users. Use this section to describe a short plan that anybody can follow that demonstrates the feature is working. This can then be used during testing, and to show off after release. Please add an entry to http://testcases.qa.ubuntu.com/Coverage/NewFeatures for tracking test coverage.

This need not be added or completed until the specification is nearing beta.

Unresolved issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.

CategorySpec

ReliableRaid (last edited 2015-01-28 01:12:46 by penalvch)

-  ⇤ ← Revision 49 as of 2010-03-09 23:32:03 → 
  Size: 14217
  Editor: 91-66-49-26-dynip
  Comment:
+   ← Revision 50 as of 2010-03-14 10:54:56 → ⇥
  Size: 14373
  Editor: 91-66-49-26-dynip
  Comment: luks on raid destructions
-Deletions are marked like this.
+Additions are marked like this.
 Line 21:
-The assembling of arrays with "mdadm" has been transitioned from the Debian startup scripts to the hotplug system (udev rules), however some bugs defy the hotplug mechanism and three things that are generally expected (as in just works in other distros) are missing functionality in Ubuntu:
+The assembling of arrays with "mdadm" has been transitioned from the Debian startup scripts to the hotplug system (udev rules), however some bugs defy the hotplug mechanism and other things that are generally expected (as in just works in other distros) are missing functionality in Ubuntu:

 1. Luks on raid. If luks is installed on a raid mirror the raid gets destroyed eventually. 
Bug #531240 breaking raid: root raid_member opened as luks

Ubuntu Wiki