kenju's blog

About Programming and Mathematics

論文『Dynamo: Amazon's Highly Available Key-value Store』を読んだ

AWS DynamoDB がサービスとしてローンチする前、Amazon 社内でショッピングカート機能などに利用されていた "Dynamo" の設計について書かれた論文。DynamoDB を普段使う身として、そもそもどういう課題を解決しようとしていたのか、そのためにどういった解決策をとったのか、という設計の骨子を知ることができる。

高い可用性と信頼性を担保する DynamoDB が、実は partitioning や conflict resolution, replication といったそれぞれの分野におけるベストプラクティスを着実に積み上げた、堅実なプロダクトということがわかる。

最初にローンチされて以降、Global Index/Secondary Index や TTL, DynamoDB Streams や Point-In-Time-Recovery, Auto-Scaling に On-Demand Backup など、数々の新機能を搭載し、ますます活躍の幅を広げている DynamoDB。しかし、それがもともとは、RDBMS には複雑すぎるシンプルなビジネス要件を、ただし EC サイトであることから求められる非常に高い可用性と信頼性を、マルチリージョンで担保するために、consistent hashing や Quorum models など定番とも言えるアルゴリズムを、Java で実直に実装したところがスタートだったという。熱い話だ。

Dynamo が実装している一連の技術は、論文に何度も登場する。例えば、Gossip-based membership や Sloppy Quorum については初見だったため、この論文をきっかけに学んだ単語も多い。

特に 4. SYSTEM ARCHITECTURE で紹介されている、使用技術の一覧がわかりやすい。こちらを参照されたい。

Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates
Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are note available
Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background
Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Reading Notes

p205

  • Dynamo provides a simple primary-key only interface to meet the requirements of these applications
  • Dynamo uses a synthesis of well known techniques to achieve scalability and availability
    • data is partitioned and replicated using consistent hashing
    • consistency is facilitated by object versioning
    • the consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol
    • employs a gossip based distributed failure detection and membership protocol

p206

  • however, a relational database is a solution that is far from ideal. Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS

p207

  • An important design consideration is to decide when to perform the process of resolving update conflicts ... The next design choice is who performs the process of conflict resolution

p209

  • First, Dynamo is targeted mainly at applications that need an "always writeable" data store where no updates are rejected due to failures or concurrent writes
  • Dynamo's partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts

p210

  • Dynamo replicates its data on multiple hosts
  • The list of nodes that is responsible for storing a particular key is called the preference list ... so that every node in the system can determine which nodes should be in this list for any particular key
  • Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously
  • In order to provide this kind of guarantee, Dynamo treats the result of each modification as a new and immutable version of the data
  • Dynamo uses vector clocks in order to capture causality between different versions of the same object. A vector clock is effectively a list of (node, counter) pairs

p213

  • In Dynamo, each storage node has three main software components: request coordination, membership and failure detection, and a local persistence engine. All there components are implemented in Java.

p219

  • The production use of Dynamo for the past year demonstrates that decentralized techniques can be combined to provide a single highly-available system