04-01 Hadoop MapReduce Required

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

Motivation

The Challenge:

  • Processing large amounts of raw data (TB to PB).
  • Conceptually straightforward computations (e.g., counting, sorting).
  • Must finish in a reasonable time.

The Constraints:

  • Processor Speed: Moore’s Law is slowing down (frequency stagnation).
  • Parrelism: Limits to local parallelism (heat, complexity).
  • Distributed Computing: The only way to scale is horizontal scaling (adding more commodity machines).

The Solution: Restrict the programming model to simple operations that can be automatically parallelized.

What is MapReduce?

Proposed by Google in 2004, MapReduce hides the messy details of distributed systems:

  • Parallelization
  • Fault-tolerance
  • Data distribution
  • Load balancing

The Model

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.

Key Functions:

  1. Map: (key_1, value_1) -> list(key_2, value_2)
    • takes an input pair and produces a set of intermediate key/value pairs.
  2. Reduce: (key_2, list(value_2)) -> list(key_3, value_3)
    • takes an intermediate key and a set of values for that key and merges them to form a smaller set of values.

Example: Word Count

Problem: Count the occurrences of each word in a large document collection.

Input

"Apple Banana Apple Cherry Banana Apple"

Map Phase

The mapper processes the input and emits a key-value pair for each word:

  • ("Apple", 1)
  • ("Banana", 1)
  • ("Apple", 1)
  • ("Cherry", 1)
  • ("Banana", 1)
  • ("Apple", 1)

Shuffle and Sort Phase

The framework groups values by key:

  • "Apple" -> [1, 1, 1]
  • "Banana" -> [1, 1]
  • "Cherry" -> [1]

Reduce Phase

The reducer sums the values for each key:

  • "Apple" -> 3
  • "Banana" -> 2
  • "Cherry" -> 1

Python Pseudocode

# Mapper
def map(key, value):
    # key: document name
    # value: document contents
    for word in value.split():
        emitIntermediate(word, 1)

# Reducer
def reduce(key, values):
    # key: word
    # values: list of counts
    result = 0
    for v in values:
        result += v
    emit(key, result)

Advanced Concepts

Combiner

  • Goal: Reduce network traffic.
  • Function: Runs on the map node (mini-reducer). Aggregates data locally before sending to the reducer.
  • Example: In Word Count, a mapper might emit ("Apple", 1) three times. A combiner sums them to ("Apple", 3) before sending over the network.

Data Locality

MapReduce tries to run the map task on the node where the data resides (HDFS block).

  • “Moving computation is cheaper than moving data.”

Fault Tolerance

  • Worker Failure: If a worker node fails, the master re-schedules the task on another worker. Re-execution is possible because Map and Reduce functions are assumed to be deterministic.
  • Master Failure: Rare. Traditionally a single point of failure (in MRv1), but addressed in YARN (MRv2) with High Availability.

Summary

MapReduce provides a simple yet powerful model for large-scale data processing. By forcing the computation into Map and Reduce phases, the framework can handle the complexities of distributed execution, allowing developers to focus on the logic.

← Back to Chapter Home