Storage - Distributed storage system (DSS)

Storage - Distributed storage system (DSS). This is very important topic if you in storage domain, Why? this is a good concept of storage and might help you to ace the interview.

INTERVIEW QUESTION

Luminari

8/14/20244 min read

In today's data-driven world, the importance of reliable and scalable storage systems cannot be overstated. As the amount of digital data continues to grow exponentially, traditional centralized storage approaches are becoming increasingly inadequate. This is where Distributed Storage Systems (DSS) come into play. In this blog post, we'll delve into the concept of DSS, its benefits, and explore an example implementation.

What is Distributed Storage System?

A Distributed Storage System (DSS) is a type of storage architecture that distributes data across multiple nodes or devices in a network. Unlike traditional centralized storage systems, where all data is stored on a single server, DSS

splits data into smaller chunks and stores them on different machines. This approach provides several advantages, including:

1. Scalability: As the volume of data grows, it's easier to add new nodes to the system rather than upgrading individual servers.

2. Fault tolerance: If one node fails, others can continue operating without interruption.

3. Increased availability: Data is no longer confined to a single point of failure.

4. Cost-effectiveness: By distributing data across multiple nodes, you can utilize lower-cost hardware and reduce storage costs.

Key Components of a Distributed Storage System

A DSS typically consists of the following components:

1. Storage Nodes: These are individual machines that store data chunks in a distributed manner. Each node can be a server, NAS (Network-Attached Storage), or even a cloud-based storage service.

2. Metadata Server: This component stores metadata about the stored data, such as file names, locations, and checksums.

3. Distributed File System: This is responsible for managing the distribution of data across nodes and ensuring that each node maintains accurate metadata.

4. Communication Protocol: This defines how nodes communicate with each other to manage data operations.

Example Implementation: Ceph Distributed Storage System

To illustrate the concept, let's take a look at Ceph (Cost-Effective, High-performance Storage), an open-source DSS developed by Red Hat. Ceph is designed for commodity hardware and can scale to meet the needs of large-scale applications.

In our example, we'll set up a 3-node cluster with each node running a separate instance of Ceph:

Node 1 (Ceph Metadata Server)

* IP Address: 192.168.1.100

* Role: Ceph Metadata Server and Monitor

* Services:

+ ceph-mgr

+ ceph-osd (for storing metadata)

+ ceph-rbd (for block-level storage)

Node 2 (Ceph Storage Node)

* IP Address: 192.168.1.101

* Role: Ceph OSD (Object Store Daemon) and Monitor

* Services:

+ ceph-osd

+ ceph-mon (for monitoring and reporting)

+ rbd-mirror (for mirroring data)

Node 3 (Ceph Storage Node)

* IP Address: 192.168.1.102

* Role: Ceph OSD and Monitor

* Services:

+ ceph-osd

+ ceph-mon

+ rbd-mirror

In this setup, the first node acts as both a metadata server and a monitor, while the other two nodes are dedicated storage nodes with their own monitors.

Example Use Cases

Ceph can be used in various scenarios, including:

1. Cloud Storage: Ceph is ideal for building cloud storage systems, where scalability and reliability are paramount.

2. High-Performance Computing (HPC): Ceph's block-level storage capabilities make it suitable for HPC environments that require fast access to data.

3. Big Data Analytics: Ceph can be used as a scalable and fault-tolerant solution for storing and processing large datasets.

how the data is stored across the nodes in distrubuted storage system ?

In a distributed storage system, data is typically split into smaller chunks called "data blocks" or "chunks," which are then stored across multiple nodes. Here's a detailed explanation of how this works:

Data Block Splitting

When a file or dataset is uploaded to the distributed storage system, it's first broken down into smaller data blocks, also known as "chunking." The size of each chunk can vary depending on the specific implementation, but it's typically

in the range of 10-100 MB. The goal of chunking is to distribute the data evenly across multiple nodes, reducing the load on any single node.

Chunk Metadata

Each data block (or chunk) has associated metadata that includes:

1. Chunk ID: A unique identifier for the chunk.

2. Data block size: The size of the chunk in bytes.

3. Checksum: A digital fingerprint of the chunk to verify its integrity.

4. Replication information: Information about how many copies of this chunk exist (more on this later).

Placement of Chunks Across Nodes

The distributed storage system uses a combination of algorithms and techniques to determine where each chunk should be stored across multiple nodes. This placement strategy is crucial for ensuring data availability, fault tolerance, and

load balancing.

Here are some common approaches used to place chunks across nodes:

1. Hash-based chunk placement: The system generates a hash value from the chunk ID (or a combination of chunk ID and other metadata). This hash value determines which node(s) will store the chunk.

2. Distributed hash tables (DHTs): Each node maintains a DHT that maps chunks to specific nodes. When a new chunk is created, its ID is hashed using a DHT-specific algorithm to determine where it should be stored.

Data Replication

To ensure data durability and availability in the event of node failures or outages, the system replicates each chunk across multiple nodes. The number of replicas can vary depending on the specific configuration:

1. Single replica: Each chunk is stored only once.

2. Multiple replicas: Chunks are duplicated multiple times, typically 3-6 copies.

When creating a new chunk, the system ensures that the same data block (or its replicas) is not stored on nodes with identical characteristics (e.g., same hardware, software, or network configuration). This helps mitigate

single-point-of-failure risks and ensures load balancing.

Data Access and Retrieval

To access a specific file or dataset in the distributed storage system:

1. Metadata lookup: The system retrieves the metadata associated with the chunk(s) that comprise the requested data.

2. Chunk retrieval: Based on the metadata, the system locates the replicas of each chunk across multiple nodes.

3. Data aggregation: Once all required chunks are retrieved from their respective nodes, they're combined to recreate the original file or dataset.

Example Walkthrough

Suppose we have a distributed storage system consisting of 5 nodes (A-E) and a 100 MB file that's split into 10 data blocks (chunks), each 10 MB in size. The system uses hash-based chunk placement with a 3-replica configuration:

| Chunk ID | Data block size | Replication info |

| --- | --- | --- |

| Chunk_1 | 10 MB | Node A, B, C |

| Chunk_2 | 10 MB | Node D, E, F |

| ... | ... | ... |

When a user requests access to the original file:

1. The system retrieves metadata for each chunk (Chunk_1 to Chunk_10).

2. It locates replicas of each chunk across nodes:

* Chunk_1 is on Node A, B, and C.

* Chunk_2 is on Node D, E, and F.

3. Once all chunks are retrieved from their respective nodes, the system combines them to recreate the original 100 MB file.

This example illustrates how data is split into smaller chunks, stored across multiple nodes, and made available for access through a distributed storage system.

Preparing for interview?