Diff for "MaverickCloudStorageSpec"

MaverickCloudStorageSpec

Differences between revisions 3 and 4

Launchpad Entry: server-maverick-cloud-datastores
Created: 2010-05-20
Contributors: ClintByrum
Packages affected: couchdb , mongodb

Summary

This spec details the steps we will take in Maverick to improve support for popular distributed/cloud based storage technologies such as CouchDB, MongoDB, and Cassandra.

Release Note

Maverick includes CouchDB, a distributed data store, in main, and a number of other cloud friendly data storage packages in Universe/Multiverse, including MongoDB, Cassandra, and Drizzle

Rationale

In order to make Ubuntu Server the platform of choice on cloud environments, we need to support the workloads that users typically need on such environments. This include cloud-oriented databases or datastores.

User stories

As a web developer building modern, distributed, highly scalable applications, I want to deploy applications using distributed data storage with a minimum of customization.

As a web developer, I want to be able to try out popular cloud/distributed data storage solutions and remove them cleanly if they do not fit my needs.

As an ops engineer supporting distributed applications in and out of the cloud, I want to deploy critical infrastructure pieces such as data storage from known distributions without having to create custom builds of complicated software.

Assumptions

Design

CouchDB

With desktop couch already in main, CouchDB is a logical choice for promotion from universe to main.

MongoDB

MongoDB is already packaged in Debian. Ensure that merge/sync with debian has the most up to date version possible for Maverick.

Cassandra

Cassandra needs packaging for universe. There are a number of concerning dependencies, including Hadoop and Thrift.

Hadoop

Hadoop is in Debian testing and should auto sync soon. It may also be possible to drop this as a dependency as it is not necessary for normal operation, but just for data analysis.

Thrift

Thrift will need to be packaged for universe. Digg.com has already done packaging work for thrift, to be found here: http://about.digg.com/opensource/ops

thrift-cassandra

A source package, thrift-cassandra, will be created that builds binary packages for PHP, Python, Perl, and Java using thrift-gen and cassandra's definition files.

Implementation

See blueprint whiteboard server-maverick-cloud-datastores

Test/Demo Plan

Unresolved issues

N/A

UDS session agenda and discussion

Discussion notes:

Top candidates:

Couchdb
Mongo
Cassandra
Drizzle

Overview

Document store
- CouchDB (in main for karmic+)
- MongoDB (in universe for lucid+)
  - debian uptodate, we are not
Eventually‐consistent key‐value store (Dynamo implementation):
1. Cassandra (not in Ubuntu)
2. Project_Voldemort (LinkedIn) (not in ubuntu)
Tabular
- Hbase (build on top of hadoop, see server-maverick-hadoop-pig): available in maverick
- hypertable
Key/value store on disk
- redis (in universe karmic+)
- tokyo cabinet (in universe hardy+) - 1.4.37 (from debian) vs 1.4.44 (upstream)
  - maybe bug in watchfile
- tokyotyrant
  - - in debian unstable, up to date 1.1.40 - network enabled tokyocabinet storage - native, memcache, http REST access - async replication
- memcachedb (in universe jaunty+) - current but 'stable' for a long time
  - there are better way to do things now
  - Oracle is interested in it because of berkley db
Key/value store in RAM : see server-m-web20-workloads
Other NOSQL databases (from http://en.wikipedia.org/wiki/Nosql)
- Neo4j (graph db)
- Keyspace (graph db)
- ndb -- uptodate - part of mysql cluster
- RIAK
  - how fast does it move? is it worth packaging?
  - seeing adoption
SQL cloud-oriented databases
- Drizzle
  - in debian, but moving fast
  - should be synced to maverick, keep following
  - candidate for nightly vcs

Actions

CouchDB: move couchdb server binary pkg into main => YES
MongoDB: merge/sync with Debian => YES
Cassandra: package for universe
- 8+ Missing build-deps (ttx to doublecheck this)
  - avro
    - paranamer
      - .(maven).
  - ConcurrentLinkedHashMap
  - hadoop => support for data analysis ( http://architects.dzone.com/news/cassandra-adds-hadoop )
  - high-scale-lib
    - java.util.concurrent
    - java.util.hashtable
  - jackson
  - json-simple
  - thrift => packaged by digg
    - .(none?).
    - moving away from thrift 'soon'
      http://about.digg.com/opensource
- worst case: only go in multiverse in current form + nightly build

Notes

Other Databases

The following would be interesting to revisit for later cycles:

tokyotyrant
RIAK
Project_Voldemort

CategorySpec

MaverickCloudStorageSpec (last edited 2010-06-15 23:02:21 by 76-216-240-245)

-  ⇤ ← Revision 3 as of 2010-05-26 19:05:26 → 
  Size: 5442
  Editor: 76-216-240-245
  Comment: Moving old summary to rationale, writing new summary. Moving other databases to notes section. Adding Digg.com link for Thrift
+   ← Revision 4 as of 2010-05-26 19:16:41 → ⇥
  Size: 5268
  Editor: 76-216-240-245
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 52:
-Users will still need to generate their own language bindings using [[http://wiki.apache.org/cassandra/ThriftExamples|the thrift tools and documentation on Cassandra's website]].
+===== thrift-cassandra =====

A source package, thrift-cassandra, will be created that builds binary packages for PHP, Python, Perl, and Java using thrift-gen and cassandra's definition files.
-Line 62:
+Line 64:
-=== Cassandra Thrift Language Bindings ===

Cassandra will still be difficult to use without language specific bindings already generated. It may be useful to build debian packages for these.
+N/A

Ubuntu Wiki