MonitoringAndAlerting
Launchpad Entry: cloud-server-n-monitoring-alerting
Created:
Contributors:
Packages affected:
Summary
Release Note
Rationale
User stories
As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky.
As a UEC admin I can integrate my UEC deployement into my existing nagios system.
Assumptions
Design
Write the code to measure once - make it available to multiple readers.
Reader examples: collectd, munin, snmp, reconnoiter, nagios.
Implementation
Test/Demo Plan
Unresolved issues
BoF agenda and discussion
UDS Natty discussion
Follow up of https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-uec-monitoring.
User stories:
As a UEC admin I'm alerted by UEC when physical nodes go down or services are flaky.
As a UEC admin I can integrate my UEC deployement into my existing nagios system.
UEC Monitoring uses a munin plugin based on the ganglia plugin packaged with Eucalyptus. If munin is installed, it sort of auto-configures.
OpenStack has its own instrumentation that collects information on the running virtual machines and stores it.
What to measure on a UEC deployment?
Now:
* number of vms
* memory
* disk space
* cpu time used by instance
* memory consumed by instance
* disk stats by instance
measured by a script in eucalyptus
Next:
* Node controller:
* number of instance running
* resources used by each instance: number of core, disk available, memory
* generic stats: network io, disk io, power consumption
* statistics about each instance: kvm information, cpu load
* ksm
* disk io per instances
* Storage controller:
* disk io
* network io
* instance creation.
* is your cloud full?
capacity of the cloud: can more instances be spanwed?
all the resources that a user can create/request.
* power utilization?
Which framework to write probes for?
* nagios:
- already in main.
- what to alert on?
* collectd:
- MIR status:
splitting the package into two sources packages.
- supports multiple output.
- has a lot of dependencies. Performs well but tightly coupled with instrumentation.
- doesn't support graphing.
* munin:
- already in main
- already used in UEC monitoring framework
* snmp:
- already in main
* ganglia:
- in universe
- write to only rrd files.
* zenos:
- not packaged
- pull based system - ie agentless.
- upstream willing to help.
How to present the graph to sysadmin?
* munin
* collectd
* snmp
Configuration:
* unicast - point-to-point
Alerting:
* nagios may go away: there is a nagios fork. icinga (mainly changes around the web ui)? www.icinga.org
* shinken: complete rewrite in python.
* flapjack:
Actions:
* move collectd to main.
* should munin go to universe (probably not yet)
* find a graphing solution (munin, graphite, reckonater (omniti - not packaged, visage).ServerTeam/Specs/Natty/MonitoringAndAlerting (last edited 2010-11-05 02:38:30 by dsl-173-206-78-27)