Distributed Database Systems

Distributed Database Systems have become increasingly essential for modern applications due to their high availability and scalability.

While distributed database systems are more complex to manage than centralized databases, there are many fully managed database services available. However, it is still necessary to understand how they work, even when using these services.

CAP Theorem

The CAP theorem states that a distributed database system can only satisfy two of the following three properties:

  • Consistency: All nodes in the system see the same data at the same time.

  • Availability: The system is always available for reads and writes.

  • Partition Tolerance: The system remains operational even if there are network partitions between nodes.

Clustering

Clustering involves grouping multiple database servers to form a single logical unit. This can be done for various reasons, such as increasing availability, improving performance, or providing redundancy.

Sharding

Sharding distributes data across multiple physical servers or nodes by partitioning the dataset based on specific criteria (e.g., range, hash, or list partitioning).

Sharding enables databases to handle larger datasets and higher traffic by distributing the load across multiple machines. It also minimizes the risk of a single point of failure. However, sharding is more complex than clustering and requires careful management of data distribution, making it challenging to maintain data consistency across shards.

Consistent Hashing

Consistent hashing is a technique used to distribute data across a cluster of servers in a way that minimizes the impact of adding or removing servers. This is particularly useful for distributed systems that may experience changes in server count over time. Here's how it works:

  • Hash Ring: A circular hash ring is created, with each node assigned a unique hash value.

  • Data Distribution: Data items are hashed using the same hash function as the nodes. Each data item is assigned to the node whose hash value is immediately clockwise on the ring.

  • Node Addition/Removal: When a node is added or removed, only a small subset of data items needs to be redistributed. The hash ring remains relatively unchanged, allowing most data items to retain their original node assignments.

Consistent hashing reduces data movement, improves scalability, and allows the system to handle changes in the number of servers without significant performance degradation. This also enhances availability since the system can continue providing service even if some nodes fail.

Consistency Levels

Consistency levels in databases define how data is synchronized across multiple nodes, determining the trade-offs between data availability and consistency:

  • Strong Consistency: All reads return the most recent committed data.

  • Eventual Consistency: Data will eventually become consistent across all nodes, but there may be temporary inconsistencies.

  • Causal Consistency: Updates that are causally related are seen in the same order by all nodes.