~/abin
January 25, 2025

Understanding HDFS: Hadoop Distributed File System

Understanding HDFS

The Hadoop Distributed File System (HDFS) is the storage backbone of the Apache Hadoop ecosystem. It’s designed to store massive datasets reliably across clusters of commodity hardware while providing high-throughput access for data processing.

Why HDFS?

Traditional file systems weren’t built for the scale of modern data. HDFS solves this by:

  • Scaling horizontally โ€” add more machines, get more storage
  • Handling failures gracefully โ€” hardware failures are expected, not exceptional
  • Optimizing for large files โ€” designed for files in GBs and TBs, not small files
  • Write-once, read-many โ€” optimized for batch processing, not random writes

Architecture: Master-Slave Model

HDFS uses a master-slave architecture with three main components:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   HDFS Cluster                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                              โ”‚
โ”‚  โ”‚   NameNode    โ”‚  โ† Master (metadata)         โ”‚
โ”‚  โ”‚   (Master)    โ”‚                              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                              โ”‚
โ”‚          โ”‚                                      โ”‚
โ”‚          โ–ผ                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”‚
โ”‚  โ”‚   DataNode    โ”‚  โ”‚   DataNode    โ”‚  ...      โ”‚
โ”‚  โ”‚   (Slave 1)   โ”‚  โ”‚   (Slave 2)   โ”‚           โ”‚
โ”‚  โ”‚   [blocks]    โ”‚  โ”‚   [blocks]    โ”‚           โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚
โ”‚                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

1. NameNode (The Master)

The NameNode is the brain of HDFS. It manages:

  • File system namespace โ€” directory tree, file names, permissions
  • Block mapping โ€” which DataNodes store which blocks of each file
  • Client access โ€” regulating who can read/write files

The NameNode stores metadata in two files:

  • FSImage โ€” complete snapshot of the namespace
  • EditLog โ€” journal of recent changes

Important: The NameNode does NOT store actual data, only metadata.

2. DataNodes (The Workers)

DataNodes are the workhorses that store actual data:

  • Store data as blocks (default 128 MB each)
  • Send heartbeats to NameNode every 3 seconds
  • Report block status periodically
  • Perform read/write operations as instructed

When a file is written, HDFS:

  1. Splits it into blocks
  2. Distributes blocks across DataNodes
  3. Replicates each block (default: 3 copies)

3. Secondary NameNode

Despite the name, this is NOT a backup NameNode. Its job is:

  • Periodically merge EditLog with FSImage
  • Create checkpoints to prevent EditLog from growing too large
  • Speed up NameNode recovery after failures

How Data Flows

Writing a File

Client                  NameNode               DataNodes
  โ”‚                        โ”‚                       โ”‚
  โ”œโ”€โ”€โ”€โ”€ Request write โ”€โ”€โ”€โ”€โ–บโ”‚                       โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€ Block locations โ”€โ”€โ”ค                       โ”‚
  โ”‚                        โ”‚                       โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Write data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ
  โ”‚                        โ”‚       (to DN1)        โ”‚
  โ”‚                        โ”‚                       โ”‚
  โ”‚                        โ”‚    DN1 โ”€โ”€โ–บ DN2 โ”€โ”€โ–บ DN3
  โ”‚                        โ”‚    (replication pipeline)
  1. Client asks NameNode for block locations
  2. NameNode returns list of DataNodes
  3. Client writes to first DataNode
  4. DataNodes replicate in a pipeline

Reading a File

Client                  NameNode               DataNodes
  โ”‚                        โ”‚                       โ”‚
  โ”œโ”€โ”€โ”€โ”€ Request read โ”€โ”€โ”€โ”€โ”€โ–บโ”‚                       โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€ Block locations โ”€โ”€โ”ค                       โ”‚
  โ”‚                        โ”‚                       โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Read data โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ data
  1. Client asks NameNode for block locations
  2. NameNode returns DataNodes sorted by proximity
  3. Client reads directly from nearest DataNode

HDFS Architecture

Fault Tolerance

HDFS assumes hardware WILL fail. It handles this through:

Replication

  • Each block is replicated (default: 3 copies)
  • Replicas placed on different racks for safety
  • NameNode re-replicates if a DataNode fails

Heartbeats

  • DataNodes send heartbeats every 3 seconds
  • No heartbeat for 10 minutes = DataNode marked dead
  • Its blocks are re-replicated to other nodes

Checksums

  • Data integrity verified using checksums
  • Corrupted blocks are automatically recovered from replicas

Key Design Decisions

Decision Rationale
Large block size (128 MB) Reduces metadata overhead, optimizes for large files
Write-once semantics Simplifies consistency, better for batch processing
Data locality Move computation to data, not data to computation
Rack awareness Place replicas on different racks for fault tolerance

When to Use HDFS

Good for:

  • Batch processing of large files
  • Data lakes and warehouses
  • Log aggregation and analytics
  • Machine learning training data

Not ideal for:

  • Low-latency random access
  • Many small files (< 100 MB)
  • Frequent file modifications
  • Real-time streaming (use Kafka instead)

Summary

HDFS is designed for one thing: storing massive amounts of data reliably and cheaply. It achieves this through:

  1. Master-slave architecture โ€” NameNode manages metadata, DataNodes store data
  2. Block storage โ€” files split into large blocks, distributed across cluster
  3. Replication โ€” every block copied 3x for fault tolerance
  4. Failure handling โ€” automatic recovery when nodes fail

It’s not the right tool for every job, but for large-scale batch processing of big data, HDFS remains a foundational technology.


Further reading: The original Google File System paper that inspired HDFS.