04-01 Hadoop MapReduce Required

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Motivation

The Challenge:

Processing large amounts of raw data (TB to PB).
Conceptually straightforward computations (e.g., counting, sorting).
Must finish in a reasonable time.

The Constraints:

Processor Speed: Moore’s Law is slowing down (frequency stagnation).
Parrelism: Limits to local parallelism (heat, complexity).
Distributed Computing: The only way to scale is horizontal scaling (adding more commodity machines).

The Solution: Restrict the programming model to simple operations that can be automatically parallelized.

What is MapReduce?

Proposed by Google in 2004, MapReduce hides the messy details of distributed systems:

Parallelization
Fault-tolerance
Data distribution
Load balancing

The Model

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Key Functions:

Map: (key_1, value_1) -> list(key_2, value_2)
- takes an input pair and produces a set of intermediate key/value pairs.
Reduce: (key_2, list(value_2)) -> list(key_3, value_3)
- takes an intermediate key and a set of values for that key and merges them to form a smaller set of values.

Example: Word Count

Problem: Count the occurrences of each word in a large document collection.

Input

"Apple Banana Apple Cherry Banana Apple"

Map Phase

The mapper processes the input and emits a key-value pair for each word:

("Apple", 1)
("Banana", 1)
("Apple", 1)
("Cherry", 1)
("Banana", 1)
("Apple", 1)

Shuffle and Sort Phase

The framework groups values by key:

"Apple" -> [1, 1, 1]
"Banana" -> [1, 1]
"Cherry" -> [1]

Reduce Phase

The reducer sums the values for each key:

"Apple" -> 3
"Banana" -> 2
"Cherry" -> 1

Python Pseudocode

# Mapper
def map(key, value):
    # key: document name
    # value: document contents
    for word in value.split():
        emitIntermediate(word, 1)

# Reducer
def reduce(key, values):
    # key: word
    # values: list of counts
    result = 0
    for v in values:
        result += v
    emit(key, result)

Advanced Concepts

Combiner

Goal: Reduce network traffic.
Function: Runs on the map node (mini-reducer). Aggregates data locally before sending to the reducer.
Example: In Word Count, a mapper might emit ("Apple", 1) three times. A combiner sums them to ("Apple", 3) before sending over the network.

Data Locality

MapReduce tries to run the map task on the node where the data resides (HDFS block).

“Moving computation is cheaper than moving data.”

Fault Tolerance

Worker Failure: If a worker node fails, the master re-schedules the task on another worker. Re-execution is possible because Map and Reduce functions are assumed to be deterministic.
Master Failure: Rare. Traditionally a single point of failure (in MRv1), but addressed in YARN (MRv2) with High Availability.

Summary

MapReduce provides a simple yet powerful model for large-scale data processing. By forcing the computation into Map and Reduce phases, the framework can handle the complexities of distributed execution, allowing developers to focus on the logic.

Service-Oriented Architecture and Cloud Computing