Tuesday, October 30, 2012

Megastore: Scalable and highly available datastore


Megastore is a library on top of BigTable which provides app developers of the Google App Engine a convenient and traditional  RDBMS flavor to develop applications. It is available on GAE since 2011 and being used on more than 100 production applications in Google, handling immense amount of reads and writes daily.

Motivation behind the development of Megastore is to please three different people - users, admins and developers. It should be highly scalable and so it uses NoSQL datastore,  currently the most scalable technique. Admins want seamless fail over in case of planned shutdowns. To achieve this the replicas have to be consistent at all times, synchronous replication is done using Paxos. Finally satisfy developers by letthing them to code their application in simple and well known RDBMS+SQL style.

It is important to note that all the above mentioned requirements are conflicting.

Megastore achieves this by,
  • Partitioning the data and replicating each partition separately. 
  • It then provides ACID semantics within each partition, 
  • synchronously replicating across geographically separated data centers for high availability

Data Model & BigTable layout


It is declared in a schema and is strongly typed. Tables  - > Entities -> Properties 
Primary key is one sequence of these properties for that Entity, and are unique for that table. 
Tables are again classified as root tables and child tables( which have a foreign key referenced into the Root table).  Root entity and associated child table entities form a Entity Group. There can be different classes of entity groups.

Mapping to BigTable

Primary key values are concatenated to form BigTable row key, and the properties as columns(TableName.PropertyName). There will be a bigtable row for each entity.  Keys should be chosen smart enough to cluster entities that will be read together.

Local index is created for each entity group and are updated along with the data in the entity group. Global index is also present which is sometimes not consistent as it spans across entity groups. Index entries are again a BigTable row with the row key being indexed property values concatenated with primary key of the indexed entity.

A write ahead log is maintained with each entity group. Each entity group is synchronously replicated across data centers by replicating this log of the entity group.

Architecture


Reads and Writes in Megastore

Read

A coordinator server provisions for local reads at a replica. It keeps track of the entity groups to be up to date. So a read starts querying local replica coordinator to check if the replica is up-to-date. It then reads the highest applied log position and time stamp from the local replica. With this timestamp it read from the BigTable.
If the local replica is not up-to-date , find the highest log position from majority of other replicas do a "catchup" to get the log and update the current replica. It then sends the coordinator a validate message to assert that replica is up to date and then perform the actual read.

Megastore provides three levels of read consistency -

  • Current Read: Application reads at the timestamp of the latest committed and applied transaction
  • Snapshot Read: Reads from the last applied transaction, there might still be some committed transactions not applied yet
  • Inconsistent Read: Just read from the top for application which do not tolerate any latency.

There are bunch of replicas called Read-Only replicas, contain full snapshots of the data may not be up-to-date but useful for the applications which can tolerate stale data. These are used for Snapshot Reads.

Write


 Paxos with leaders - Traditional usage for locking, master election, replication of metadata and configurations. Megastore uses paxos for across data center replication and high availability. Megastore introduces a concept of leaders, a leader is the replica which was successful in the previous log position consensus.

A write starts with next unused log position and getting the timestamp of the last write. Current replica packages the changes to the state and timestamp as the consensus value for the next log position and nominates itself for next leader position and proposes to the leader. If successful this change is update at all replicas or else the transaction is aborted and retried. Replicas which did not accept get a invalidate message to their coordinator to update their state.

To achieve this quorum during writes there are set of replicas called Witness Replicas. These are different from full replicas ,  as they do not maintain actual data and do not have a coordinator.  They just have the log and therefore participate in consensus and help to achieve a majority vote for replicas to make the write successful.

Conclusion

This paper is highly technical and gives a lot of details on Megastore's architecture and implementation. Using BigTable as the NoSQL datastore and providing ACID semantics on top of it is a very novel idea. Modifying Paxos to suit for the requirement here is another good takeaway from this paper.  The results show very good performance in production applications. Five 9's availability shows good promise for others to use similar architecture.

Discussion points

  • Design decision:   Limiting the writes to few writes per second.
  • Latency vs Strong consistency requirements in Megastore. 
  • NoSQL vs relational databases, how long can the developers be using traditional style? 
  • Will ACID semantics remain forever?

Interested Readings:

Thursday, October 11, 2012

Availability in Globally Distributed Storage Systems



This paper is a study of cloud storage systems at Google for a period of one year.  They have developed two different models and compared their predictions from the models with the statistics they have observed. With large number of production cells they have gathered enough statistics for various encoding schemes and other system parameters.

Availability can be broadly classified into:
Component availability - mean time to Failure (MTTF) of any system components which include machines, disks, racks.
Data availability-depends on node availability, encoding scheme used. They modeled correlated failures to understand the availability of data, single cell vs. multi-cell replication schemes.

Some terminology before going ahead:

Cell: It is a large collection of nodes coordinated with a high level program.
Disk Scrubbing: Disk scrubbing is an operation that reads all of the user data and check data blocks in a RAID array, and then relocates them if defects are found in the media.
Erasure coding: An erasure encoding transforms stripe of k units to a stripe of n units, so that it can be reconstructed from a minimum subset of units within this new stripe. Eg: Reed-Solomon encoding
 Mean Time to Failure: MTTF = uptime / number of failures
Failure burst: Sequence of node failures within a time window of w secs(w set to 120).
Failure domain:  It means all those that can potentially suffer from a common failure source and their  simultaneous failures are said to be correlated.

Problem – Failures:

They classified node failures into three categories like node restarts(software initiated restart), planned reboots(kernel upgrade), unplanned reboots(crashes).  And the important observations are that majority of unavailability caused due to node failures is because of the “planned reboots”.  But “unplanned reboots”  have the longest average duration because of recovery to be performed on unexpected system crash.
They claim that there is not much related work on discovering patterns in “Correlated Failures”.  In their study they found that only 3% of failure bursts of size greater than 10 had all the nodes in unique racks, which shows rack is a potential failure domain. Similarly network switches, and power domains are  other potential failure domains.  They have defined a score ‘rack affinity’ to identify if a failure burst is rack correlated , uncorrelated or anti-correlated. They observed that larger failure bursts have higher rack-affinity.

Solution- Replication and Recovery

Chunks can go unavailable due to many reasons and using the Erasure encoding schemes they can be recovered if there are minimum number of chunks available within that stripe based on the encoding used. But this rate is highly affected by the limited bandwidth of disks, nodes and racks. Recovery of these chunks would in turn make other chunks on this node unavailable till the recovery is complete. Rack-aware policy,  avoids placing chunks of a stripe on nodes on a single rack. It spans multiple racks so as to avoid correlated failures within the rack leading to complete unavailability of the chunk.  They have a observed a gain of 3 times in stripe MTTF using this policy.

Models and Simulations:

One, Cell-simulation, they developed a trace-based simulation method to analyze hypothetical scenarios, to understand the effect of the encoding choice and chunk recovery rate. They have predicted Unavailability over time for a given cell with large failure bursts and the results show they were close to what is actually measured for that cell.
Two, they formulated a Markov model to observe interaction between failures in a better way than the Cell simulation model . This model assumes that events occur independently but with a constant rate over time. They build a Markov chain for chunk recovery by giving priority to each state, which represent the chunk availability condition. They extend this model to accommodate multi-cell replication.

Using this model :

  1. They could successfully capture the effect of failure bursts,
  2. they could distinguish between bursts which span racks and which do not,
  3. as they enough production cells they have also validated this model for various encodings and operating conditions,
  4. they found reducing recovery time is effective only when there are few correlated failures,
  5. they also observed that gain achieved by increased replication slows down with correlated failures,
  6. multi-cell replication is the best way to avoid unavailability due to correlated failures, but with a tradeoff on network bandwidth consumption.


Conclusion:

In this paper they have focused more on the findings of the Markov model. They have studied the effects of different parameters like replication, data placement and other system parameters From the results it shows that most of the data unavailability is attributed to node failures than actual disk errors which lead to permanent data loss.  As these are findings from the Google clusters which in itself is very huge in magnitude, I believe most of the observations and findings can be directly used to implement policies in future systems.