06-03 Spark SQL and DataFrames Required

Spark SQL is one of the most widely used modules in Apache Spark, bridging the gap between relational data processing and functional programming. It allows you to run SQL queries on distributed data and provides the DataFrame API.

Traditional RDBMS vs. Spark SQL

The Limits of RDBMS

Scaling: Traditional RDBMS (PostgreSQL, MySQL, Oracle) are designed for vertical scaling. Distributing them is complex (sharding).
Data Types: Optimized for structured data (integers, strings). Struggle with semi-structured (JSON) or unstructured data.
Processing: “Under the hood” black-box optimization.

The Spark SQL Approach

Scaling: Horizontally scalable across thousands of nodes.
Flexibility: Handles structured (Schema) and semi-structured (JSON, Parquet) data natively.
Integration: Mix SQL queries with complex code (Java/Scala/Python) in the same application.

Core Concepts

1. DataFrames

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create DataFrame from JSON
df = spark.read.json("people.json")

# Show schema
df.printSchema()
# root
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

# DSL Operations
df.select("name").show()
df.filter(df['age'] > 21).show()

2. Running SQL Queries

You can register a DataFrame as a temporary view and run standard SQL queries against it.

df.createOrReplaceTempView("people")

sqlDF = spark.sql("SELECT * FROM people WHERE age BETWEEN 13 AND 19")
sqlDF.show()

3. Interoperability

Spark SQL supports a wide variety of data sources:

Parquet/ORC: Optimized columnar storage formats.
JSON/CSV: Common text formats.
Hive: Access existing Hive warehouses.
JDBC/ODBC: Connect to external databases (MySQL, PostgreSQL).

The Catalyst Optimizer

The secret sauce of Spark SQL is the Catalyst Optimizer. It leverages advanced functional programming features (like pattern matching in Scala) to build an extensible query optimizer.

Analysis: Resolves table and column names.
Logical Optimization: Applies standard rules (predicate pushdown, constant folding, projection pruning).
Physical Planning: Generates multiple physical plans and selects the most efficient one (Cost-Based Optimization).
Code Generation: Generates efficient Java bytecode to execute the query (Project Tungsten).

Summary

Spark SQL provides the best of both worlds: the ease of use of SQL and the power/scalability of distributed processing. It is the foundation for modern Data Lakehouse architectures.

Service-Oriented Architecture and Cloud Computing