10-01 Big Data Fundamentals and Platforms Required
Big Data has become a cornerstone of modern computing, enabling organizations to extract insights from massive datasets that were previously impossible to process. This lecture covers the fundamentals of big data and the platforms that make big data processing possible.
What is Big Data?
Definition
Big Data refers to:
- Extremely large, complex data sets
- Data that requires new techniques to handle
- Individual data items can be small or large
- Simple sensor events (bytes)
- High-quality satellite images (gigabytes)
[!NOTE] It’s not just about size
Big data is characterized by multiple dimensions beyond just volume.
The Four V’s of Big Data
1. Volume
- Big size, large datasets
- Massive amounts of small data items
- Examples:
- 112M+ rows of NYC taxi trip data
- 60 PB of queryable event data at Lyft
- Petabytes of satellite imagery
2. Variety
- Complex, different formats
- Multiple types of data and their relationships
- Examples:
- Structured (databases)
- Semi-structured (JSON, XML)
- Unstructured (text, images, video)
3. Velocity
- Generating speed
- Data movement speed
- Examples:
- 1.4 billion events per day from sensors
- Real-time social media streams
- High-frequency trading data
4. Veracity
- Quality varies significantly
- Timeliness, accuracy, trustworthiness
- Examples:
- Sensor data with noise
- User-generated content
- Incomplete records
┌─────────────────────────────────────────┐
│ Big Data = Volume + Variety + │
│ Velocity + Veracity │
└─────────────────────────────────────────┘
Sources of Big Data
1. Social Media
Data generated by human activities:
- Meta/Facebook, TikTok, Twitter, Instagram
- User posts, interactions, preferences
- Billions of users generating content daily
2. Internet of Things (IoT) / Industry 4.0
Data from monitoring equipment and environments:
- Smart sensors and devices
- Industrial equipment monitoring
- Environmental sensors
- Example: 5M sensors generating 1.4B events/day
3. Advanced Sciences
Data from advanced instruments:
- Earth observation: Sentinel satellites, James Webb telescope
- Healthcare: Personal health data, disease information
- Genomics: DNA sequencing data
4. Business Data
- Customer data: Transactions, behavior, preferences
- Asset management: Cars, homes, equipment
- Software systems: Logs, traces, test results
Operational vs Analytical Data
Operational Data (OLTP)
Purpose: Business/system operations
- Read/Write: Frequent updates
- OLTP: Online Transaction Processing
- Examples:
- E-commerce transactions
- Banking operations
- Inventory management
┌─────────────────────────────────┐
│ Operational Database │
│ - Current state │
│ - Frequent updates │
│ - Transaction focused │
│ - Normalized schema │
└─────────────────────────────────┘
Analytical Data (OLAP)
Purpose: Understanding and optimization
- Historical/Integrated: Write once, read many
- OLAP: Online Analytical Processing
- Examples:
- Customer behavior analysis
- Sales trends
- Predictive models
┌─────────────────────────────────┐
│ Analytical Data Warehouse │
│ - Historical data │
│ - Read-optimized │
│ - Aggregated views │
│ - Denormalized schema │
└─────────────────────────────────┘
Both types matter: Big data platforms must handle both operational and analytical workloads.
Why Big Data Matters
The Value of Data
Current hot topics: Large Language Models (LLMs) / Generative AI
Top-down perspective: Data economy
More data → More insights → Better decisions → Business success
Bottom-up perspective: Optimization
Understanding → Optimizing → Saving cost / Creating value
The Unreasonable Effectiveness of Data
Key Principle: With more data, the same algorithm performs much better!
Source: Halevy, Norvig, and Pereira (2009)
This principle shows that:
- Data quality can compensate for algorithm simplicity
- More data often beats better algorithms
- Scale enables new capabilities
What are Platforms?
Business Definition
From “Platform Revolution”:
“A platform is a business based on enabling value-creating interactions between external producers and consumers. The platform provides an open, participative infrastructure for these interactions and sets governance conditions for them.”
Big Data Platform Interpretation
Big data platforms are large-scale service platforms that:
- Provide on-demand computing for data-centric products
- Enable on-demand analytics services
- Offer on-demand data management
- Enable interactions between data producers and consumers
- Facilitate exchange of big data and data products
Not just: A database or data marketplace (even if big!)
Data Perspectives
Data as an Asset
Concept: Data is valuable and must be managed and exploited
- Ownership: Clear data ownership
- Governance: Policies and controls
- Protection: Security and compliance
- Exploitation: Maximize value extraction
Data as a Product
Concept: Product thinking for data processing and delivery
- User satisfaction: Data users are customers
- Quality: Data must meet user needs
- Discoverability: Easy to find and understand
- Self-service: Users can access independently
Modern approach: Combine both perspectives for effective data management
Big Data Platform Architecture
Onion Architecture
┌─────────────────────────────────────────┐
│ Big Data Services & Applications │
│ (Analytics, ML, BI, Dashboards) │
├─────────────────────────────────────────┤
│ Middleware Platforms │
│ (Building, deploying, operating │
│ reliable big data services) │
├─────────────────────────────────────────┤
│ Data-Centric Virtualized │
│ Infrastructures │
│ (Compute, Storage, Network) │
├─────────────────────────────────────────┤
│ Consumers & Producers │
│ (Sensors, Things, People, Processes) │
└─────────────────────────────────────────┘
Core Platform Services
┌─────────────────────────────────────────┐
│ Data Sources │
│ (IoT, B2B/B2C, Transport, etc.) │
└────────────┬────────────────────────────┘
│
┌────────────▼────────────────────────────┐
│ Core Services │
│ ┌──────────┐ ┌──────────┐ │
│ │ Data │ │ Data │ │
│ │ Ingest │ │ Store │ │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Data │ │ Data │ │
│ │Processing│ │ Query │ │
│ └──────────┘ └──────────┘ │
└────────────┬────────────────────────────┘
│
┌────────────▼────────────────────────────┐
│ Analytics & Applications │
│ - ML Algorithms & Pipelines │
│ - Visualization │
│ - Business Applications │
└─────────────────────────────────────────┘
Big Data Pipelines
Typical Pipeline Stages
1. Data Ingestion
- Extract data from various sources
- Transform data into usable formats
- Load data into storage (ETL)
2. Data Storage
- Raw data storage (data lake)
- Processed data storage
- Analytical data storage (data warehouse)
3. Data Processing
- Batch processing (Hadoop, Spark)
- Stream processing (Kafka, Flink)
- Machine learning (training, inference)
4. Data Analysis
- Querying (SQL, NoSQL)
- Analytics (aggregations, statistics)
- Visualization (dashboards, reports)
Example Pipeline
Data Source → Ingestion → Raw Data Store
↓
Data Processing
↓
Analytical Store → Visualization
↓
Machine Learning → Model Serving
Core Principles for Big Data Platforms
┌─────────────────────────────────────────┐
│ You (Big Data Platform Expert) │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Programming │ │ Data Mgmt │ │
│ │Models & │ │ Models & │ │
│ │Frameworks │ │ Tools │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────────────────────┐ │
│ │ Large-Scale Computing │ │
│ │ Platforms (Service-Based) │ │
│ └─────────────────────────────┘ │
│ │
│ ┌─────────────────────────────┐ │
│ │ Provisioning, Automation │ │
│ │ and Analytics Processes │ │
│ └─────────────────────────────┘ │
└─────────────────────────────────────────┘
Key Focus Areas
1. Design/Development vs Operation
- Design: Architecture, data models, algorithms
- Operation: Deployment, monitoring, maintenance
2. Data-Centric vs Service-Centric
- Data-centric: Data models, storage, governance
- Service-centric: APIs, microservices, orchestration
- Platform-centric: Infrastructure, scalability, reliability
3. SQL-Style vs Programmatic Processing
- SQL-style: Declarative queries, BI tools
- Programmatic: MapReduce, Spark, custom workflows
4. Quality and Governance
- Data quality: Accuracy, completeness, timeliness
- Governance: Policies, compliance, security
Learning Goals
As a User
- Able to use and program atop big data platforms
- Understand available services and APIs
As a Provider
- Able to operate big data platforms
- Monitor, scale, and maintain systems
As a Designer/Architect
- Able to design new solutions for big data platforms
- Make architectural decisions
As a Developer
- Able to develop services/applications in big data platforms
- Implement data pipelines and analytics
Data Governance Concerns
Key Challenges
1. Data Quality
- Accuracy: Is the data correct?
- Completeness: Is all data present?
- Timeliness: Is data current?
2. Data Lineage
- Tracking: Where did data come from?
- Transformations: How was it processed?
- Audit trail: Who accessed it?
3. Multi-Tenancy
- Isolation: Separate tenant data
- SLAs: Different service levels
- Security: Access control
4. Compliance
- GDPR: Right to be forgotten
- Privacy: Data protection
- Regulations: Industry-specific rules
Example Concerns
Data → Analysis → Result
↓ ↓ ↓
Where? Price? Quality?
Privacy? Ethics? Compliance?
Processing Models
Batch Processing
Characteristics:
- Process large volumes of data
- Not real-time (hours to days)
- High throughput
Example:
Daily customer transactions → Batch ETL → Data warehouse
Stream Processing
Characteristics:
- Process data as it arrives
- Low latency (seconds to minutes)
- Continuous processing
Example:
IoT sensor data → Stream processing → Real-time alerts
Real-Time Analytics
Characteristics:
- Immediate insights
- Sub-second latency
- Interactive queries
Example:
User clicks → Real-time recommendation engine
Summary
Big data platforms are essential infrastructure for modern data-driven applications:
- Big data is characterized by Volume, Variety, Velocity, and Veracity
- Platforms enable interactions between data producers and consumers
- Architectures follow layered approaches (onion model)
- Pipelines move data through ingestion, storage, processing, and analysis
- Governance ensures quality, security, and compliance
- Multiple processing models serve different use cases
Understanding big data platforms is fundamental for building scalable, data-intensive applications.