06-04 Scalable Machine Learning with MLlib Optional

Apache Spark MLlib is a scalable machine learning library that brings high-performance ML algorithms to distributed computing.

Why Distributed ML?

Traditional ML libraries (like scikit-learn) run on a single machine. When your dataset becomes too large to fit in memory (TB or PB scale), you need a distributed solution. MLlib is designed to scale horizontally.

Key Capabilities

MLlib covers the standard ML workflow primitives:

1. Classification & Regression

  • Classification: Logistic Regression, Naive Bayes, Decision Trees, Random Forests.
  • Regression: Linear Regression, Generalized Linear Regression (GLM).

2. Clustering

  • K-Means: Partitioning data into K distinct clusters.
  • LDA (Latent Dirichlet Allocation): Topic modeling.
  • Gaussian Mixture Models.

3. Collaborative Filtering

  • ALS (Alternating Least Squares): Used for recommendation systems (e.g., “Users who bought X also bought Y”).

4. Dimensionality Reduction

  • PCA (Principal Component Analysis).
  • SVD (Singular Value Decomposition).

ML Pipelines

Inspired by scikit-learn, MLlib provides a Pipeline API to create consistent workflows. A pipeline chains together Transformers and Estimators.

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# 1. Prepare data
training = spark.createDataFrame([
    (0L, "a b c d e spark", 1.0),
    (1L, "b d", 0.0),
    (2L, "spark f g h", 1.0),
    (3L, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

# 2. Configure an ML pipeline
# Tokenizer: Split text into words (Transformer)
tokenizer = Tokenizer(inputCol="text", outputCol="words")

# HashingTF: Convert words to feature vectors (Transformer)
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

# LogisticRegression: The learning algorithm (Estimator)
lr = LogisticRegression(maxIter=10, regParam=0.001)

# Pipeline: Chain stages together
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# 3. Train model
model = pipeline.fit(training)

# 4. Make predictions
prediction = model.transform(test_data)

RDD vs DataFrame API

  • RDD API (pyspark.mllib): The original API. Now in maintenance mode.
  • DataFrame API (pyspark.ml): The modern, primary API. It provides a more uniform set of APIs and leverages Spark SQL optimizations. Use this one.

Best Practices

  1. Data Quality: “Garbage In, Garbage Out”. Use Spark SQL to clean your data before feeding it to MLlib.
  2. Overfitting: Be careful of models that learn the “noise” in your training data. Use cross-validation.
  3. Feature Engineering: Transforming raw data into meaningful features is often more important than the choice of algorithm.

Summary

MLlib democratizes large-scale machine learning, allowing data engineers and data scientists to build complex models on massive datasets using familiar APIs.

← Back to Chapter Home