03-01 Computing Models and Hadoop YARN Required

This lecture covers fundamental distributed computing models and introduces the Hadoop ecosystem with a focus on YARN for resource management.

Distributed Computing Models

1. Client-Server Model

Structure: Clients request services, servers provide them.
Examples: Web applications, database servers, file servers.
Pros: Centralized control, easier maintenance.
Cons: Single point of failure (server), scalability bottlenecks.

2. Peer-to-Peer (P2P) Systems

Structure: All nodes are equal (peers) and share resources directly.
Examples: BitTorrent, Blockchain (Bitcoin/Ethereum).
Pros: High scalability, no single point of failure.
Cons: Difficult management, security challenges.

3. Multi-Tier Architecture

Structure: Split efficient processing into layers (tiers).
- Presentation Tier (UI)
- Application Tier (Logic)
- Data Tier (Database)
Examples: Modern web applications (React + Node.js + MongoDB).

4. Thin Clients & Compute Servers

Thin Client: Lightweight device relying on a server (e.g., Chromebooks, VDI).
Compute Server: Powerful backend processing simple requests.

5. Event-Driven Architecture

Structure: Asynchronous communication based on events/messages.
Examples: IoT systems, real-time analytics, serverless functions.

Introduction to Hadoop

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Key Characteristics

Scalable: From single servers to thousands of machines.
Fault-Tolerant: Handles hardware failures automatically.
Shared Nothing Architecture: Each node is independent.
Data Locality: Move computation to data, not data to computation.

Hadoop Core Components

HDFS (Hadoop Distributed File System): Storage layer.
YARN (Yet Another Resource Negotiator): Resource management layer.
MapReduce: Data processing model.
Hadoop Common: Utilities.

HDFS Architecture

HDFS is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

NameNode (Master): Manages metadata (file names, permissions, location of blocks).
DataNode (Worker): Stores actual data blocks.
Block Size: Default 128MB. Large blocks minimize seek time.
Replication: Default 3 replicas (for fault tolerance).

Fault Tolerance

Heartbeats: DataNodes send signals to NameNode every 3 seconds.
Re-replication: If a DateNode fails, NameNode schedules replication of its blocks to other nodes.
Rack Awareness: Places replicas on different racks to survive rack failures.

Hadoop YARN

YARN decoupled the resource management and job scheduling capabilities from the original MapReduce.

Architecture

ResourceManager (RM): Global master that arbitrates resources among all applications in the system.
NodeManager (NM): Per-machine agent responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting to the RM.
ApplicationMaster (AM): Per-application library that negotiates resources from the RM and works with the NM to execute and monitor the tasks.
Container: Abstract notion of resources (memory, cpu, disk, network).

How YARN Works

Client submits an application.
ResourceManager allocates a container to start the ApplicationMaster.
ApplicationMaster asks ResourceManager for more containers.
ApplicationMaster contacts NodeManagers to start tasks in allocated containers.

Schedulers

FIFO Scheduler: First In, First Out. Simple but not efficient for shared clusters.
Capacity Scheduler: Designed for multi-tenancy. Queues get a guaranteed capacity.
Fair Scheduler: Assigns resources so applications get an equal share of resources over time.

Summary

Computing Models (Client-Server, P2P) define how distributed components interact.
Hadoop revolutionized big data by combining storage (HDFS) and processing on commodity hardware.
YARN enables Hadoop to support multiple processing engines (MapReduce, Spark, Tez) by separating resource management from execution logic.

Service-Oriented Architecture and Cloud Computing