05-01 Apache Spark Fundamentals Required

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.

Why Spark?

Limitations of MapReduce:

Slow: Heavy reliance on disk I/O (reads/writes to HDFS for every stage).
Inefficient: Not good for iterative algorithms (ML, graph processing).
Complex: Verbose code.

Spark Advantages:

Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Ease of Use: Write applications quickly in Java, Scala, Python, R, and SQL.
Generality: Combine SQL, streaming, and complex analytics.

Spark Architecture

Components

Driver: The process running the main() function of the application and creating the SparkContext. It schedules tasks.
Executor: A distributed agent responsible for executing tasks. It runs in a JVM on a worker node.
Cluster Manager: External service for acquiring resources (Standalone, YARN, Mesos, Kubernetes).

Execution Flow

Driver converts user code into valid tasks.
Driver connects to Cluster Manager to negotiate resources.
Cluster Manager launches executors on worker nodes.
Driver sends tasks to executors.
Executors execute tasks and return results to the driver.

RDD (Resilient Distributed Dataset)

The primary data abstraction in Spark.

Resilient: Fault-tolerant (recomputes missing partitions using lineage).
Distributed: Data resides on multiple nodes.
Dataset: Collection of objects.

Characteristics

Immutable: Once created, cannot be changed.
Lazy Evaluation: Transformations are not executed immediately.
Cacheable: Can be persisted in memory for fast reuse.

RDD Operations

1. Transformations (Lazy)

Create a new RDD from an existing one.

map(func)
filter(func)
flatMap(func)
groupByKey()
reduceByKey(func)

2. Actions (Eager)

Return a value to the driver program after running a computation on the dataset.

count()
collect()
take(n)
saveAsTextFile(path)

Lazy Evaluation

Spark records transformations as a DAG (Directed Acyclic Graph) but does nothing until an action is called.

Allows Spark to optimize the execution plan (e.g., pipelining maps and filters).
Reduces unneeded data transfer.

Example: Word Count in PySpark

from pyspark import SparkContext

sc = SparkContext("local", "Word Count")

# 1. Load Data (Transformation)
text_file = sc.textFile("input.txt")

# 2. Transformations (Lazy)
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# 3. Action (Trigger Execution)
counts.saveAsTextFile("output")

Spark vs. Hadoop MapReduce

Feature	Hadoop MapReduce	Apache Spark
Processing	Disk-based (Iterative writes)	In-memory (Cacheable)
Speed	Slower	Up to 100x faster
Difficulty	High (Verbose Java)	Low (High-level APIs)
Use cases	Batch processing	Batch, Streaming, ML, Interactive

Summary

Apache Spark is the successor to MapReduce for most modern big data workloads. Its in-memory capability and rich ecosystem (SQL, Streaming, MLlib) make it a versatile tool for data engineers and data scientists.

Service-Oriented Architecture and Cloud Computing