06-04 Scalable Machine Learning with MLlib Optional
Apache Spark MLlib is a scalable machine learning library that brings high-performance ML algorithms to distributed computing.
Why Distributed ML?
Traditional ML libraries (like scikit-learn) run on a single machine. When your dataset becomes too large to fit in memory (TB or PB scale), you need a distributed solution. MLlib is designed to scale horizontally.
Key Capabilities
MLlib covers the standard ML workflow primitives:
1. Classification & Regression
- Classification: Logistic Regression, Naive Bayes, Decision Trees, Random Forests.
- Regression: Linear Regression, Generalized Linear Regression (GLM).
2. Clustering
- K-Means: Partitioning data into K distinct clusters.
- LDA (Latent Dirichlet Allocation): Topic modeling.
- Gaussian Mixture Models.
3. Collaborative Filtering
- ALS (Alternating Least Squares): Used for recommendation systems (e.g., “Users who bought X also bought Y”).
4. Dimensionality Reduction
- PCA (Principal Component Analysis).
- SVD (Singular Value Decomposition).
ML Pipelines
Inspired by scikit-learn, MLlib provides a Pipeline API to create consistent workflows. A pipeline chains together Transformers and Estimators.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
# 1. Prepare data
training = spark.createDataFrame([
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])
# 2. Configure an ML pipeline
# Tokenizer: Split text into words (Transformer)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
# HashingTF: Convert words to feature vectors (Transformer)
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
# LogisticRegression: The learning algorithm (Estimator)
lr = LogisticRegression(maxIter=10, regParam=0.001)
# Pipeline: Chain stages together
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
# 3. Train model
model = pipeline.fit(training)
# 4. Make predictions
prediction = model.transform(test_data)
RDD vs DataFrame API
- RDD API (
pyspark.mllib): The original API. Now in maintenance mode. - DataFrame API (
pyspark.ml): The modern, primary API. It provides a more uniform set of APIs and leverages Spark SQL optimizations. Use this one.
Best Practices
- Data Quality: “Garbage In, Garbage Out”. Use Spark SQL to clean your data before feeding it to MLlib.
- Overfitting: Be careful of models that learn the “noise” in your training data. Use cross-validation.
- Feature Engineering: Transforming raw data into meaningful features is often more important than the choice of algorithm.
Summary
MLlib democratizes large-scale machine learning, allowing data engineers and data scientists to build complex models on massive datasets using familiar APIs.