04-01 Hadoop MapReduce Required
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
Motivation
The Challenge:
- Processing large amounts of raw data (TB to PB).
- Conceptually straightforward computations (e.g., counting, sorting).
- Must finish in a reasonable time.
The Constraints:
- Processor Speed: Moore’s Law is slowing down (frequency stagnation).
- Parrelism: Limits to local parallelism (heat, complexity).
- Distributed Computing: The only way to scale is horizontal scaling (adding more commodity machines).
The Solution: Restrict the programming model to simple operations that can be automatically parallelized.
What is MapReduce?
Proposed by Google in 2004, MapReduce hides the messy details of distributed systems:
- Parallelization
- Fault-tolerance
- Data distribution
- Load balancing
The Model
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Key Functions:
- Map:
(key_1, value_1) -> list(key_2, value_2)- takes an input pair and produces a set of intermediate key/value pairs.
- Reduce:
(key_2, list(value_2)) -> list(key_3, value_3)- takes an intermediate key and a set of values for that key and merges them to form a smaller set of values.
Example: Word Count
Problem: Count the occurrences of each word in a large document collection.
Input
"Apple Banana Apple Cherry Banana Apple"
Map Phase
The mapper processes the input and emits a key-value pair for each word:
("Apple", 1)("Banana", 1)("Apple", 1)("Cherry", 1)("Banana", 1)("Apple", 1)
Shuffle and Sort Phase
The framework groups values by key:
"Apple"->[1, 1, 1]"Banana"->[1, 1]"Cherry"->[1]
Reduce Phase
The reducer sums the values for each key:
"Apple"->3"Banana"->2"Cherry"->1
Python Pseudocode
# Mapper
def map(key, value):
# key: document name
# value: document contents
for word in value.split():
emitIntermediate(word, 1)
# Reducer
def reduce(key, values):
# key: word
# values: list of counts
result = 0
for v in values:
result += v
emit(key, result)
Advanced Concepts
Combiner
- Goal: Reduce network traffic.
- Function: Runs on the map node (mini-reducer). Aggregates data locally before sending to the reducer.
- Example: In Word Count, a mapper might emit
("Apple", 1)three times. A combiner sums them to("Apple", 3)before sending over the network.
Data Locality
MapReduce tries to run the map task on the node where the data resides (HDFS block).
- “Moving computation is cheaper than moving data.”
Fault Tolerance
- Worker Failure: If a worker node fails, the master re-schedules the task on another worker. Re-execution is possible because Map and Reduce functions are assumed to be deterministic.
- Master Failure: Rare. Traditionally a single point of failure (in MRv1), but addressed in YARN (MRv2) with High Availability.
Summary
MapReduce provides a simple yet powerful model for large-scale data processing. By forcing the computation into Map and Reduce phases, the framework can handle the complexities of distributed execution, allowing developers to focus on the logic.