10-01 Big Data Fundamentals and Platforms Required

Big Data has become a cornerstone of modern computing, enabling organizations to extract insights from massive datasets that were previously impossible to process. This lecture covers the fundamentals of big data and the platforms that make big data processing possible.

What is Big Data?

Definition

Big Data refers to:

  • Extremely large, complex data sets
  • Data that requires new techniques to handle
  • Individual data items can be small or large
    • Simple sensor events (bytes)
    • High-quality satellite images (gigabytes)

[!NOTE] It’s not just about size

Big data is characterized by multiple dimensions beyond just volume.

The Four V’s of Big Data

1. Volume

  • Big size, large datasets
  • Massive amounts of small data items
  • Examples:
    • 112M+ rows of NYC taxi trip data
    • 60 PB of queryable event data at Lyft
    • Petabytes of satellite imagery

2. Variety

  • Complex, different formats
  • Multiple types of data and their relationships
  • Examples:
    • Structured (databases)
    • Semi-structured (JSON, XML)
    • Unstructured (text, images, video)

3. Velocity

  • Generating speed
  • Data movement speed
  • Examples:
    • 1.4 billion events per day from sensors
    • Real-time social media streams
    • High-frequency trading data

4. Veracity

  • Quality varies significantly
  • Timeliness, accuracy, trustworthiness
  • Examples:
    • Sensor data with noise
    • User-generated content
    • Incomplete records
┌─────────────────────────────────────────┐
│  Big Data = Volume + Variety +          │
│             Velocity + Veracity          │
└─────────────────────────────────────────┘

Sources of Big Data

1. Social Media

Data generated by human activities:

  • Meta/Facebook, TikTok, Twitter, Instagram
  • User posts, interactions, preferences
  • Billions of users generating content daily

2. Internet of Things (IoT) / Industry 4.0

Data from monitoring equipment and environments:

  • Smart sensors and devices
  • Industrial equipment monitoring
  • Environmental sensors
  • Example: 5M sensors generating 1.4B events/day

3. Advanced Sciences

Data from advanced instruments:

  • Earth observation: Sentinel satellites, James Webb telescope
  • Healthcare: Personal health data, disease information
  • Genomics: DNA sequencing data

4. Business Data

  • Customer data: Transactions, behavior, preferences
  • Asset management: Cars, homes, equipment
  • Software systems: Logs, traces, test results

Operational vs Analytical Data

Operational Data (OLTP)

Purpose: Business/system operations

  • Read/Write: Frequent updates
  • OLTP: Online Transaction Processing
  • Examples:
    • E-commerce transactions
    • Banking operations
    • Inventory management
┌─────────────────────────────────┐
│  Operational Database           │
│  - Current state                │
│  - Frequent updates             │
│  - Transaction focused          │
│  - Normalized schema            │
└─────────────────────────────────┘

Analytical Data (OLAP)

Purpose: Understanding and optimization

  • Historical/Integrated: Write once, read many
  • OLAP: Online Analytical Processing
  • Examples:
    • Customer behavior analysis
    • Sales trends
    • Predictive models
┌─────────────────────────────────┐
│  Analytical Data Warehouse      │
│  - Historical data              │
│  - Read-optimized               │
│  - Aggregated views             │
│  - Denormalized schema          │
└─────────────────────────────────┘

Both types matter: Big data platforms must handle both operational and analytical workloads.

Why Big Data Matters

The Value of Data

Current hot topics: Large Language Models (LLMs) / Generative AI

Top-down perspective: Data economy

More data → More insights → Better decisions → Business success

Bottom-up perspective: Optimization

Understanding → Optimizing → Saving cost / Creating value

The Unreasonable Effectiveness of Data

Key Principle: With more data, the same algorithm performs much better!

Source: Halevy, Norvig, and Pereira (2009)

This principle shows that:

  • Data quality can compensate for algorithm simplicity
  • More data often beats better algorithms
  • Scale enables new capabilities

What are Platforms?

Business Definition

From “Platform Revolution”:

“A platform is a business based on enabling value-creating interactions between external producers and consumers. The platform provides an open, participative infrastructure for these interactions and sets governance conditions for them.”

Big Data Platform Interpretation

Big data platforms are large-scale service platforms that:

  • Provide on-demand computing for data-centric products
  • Enable on-demand analytics services
  • Offer on-demand data management
  • Enable interactions between data producers and consumers
  • Facilitate exchange of big data and data products

Not just: A database or data marketplace (even if big!)

Data Perspectives

Data as an Asset

Concept: Data is valuable and must be managed and exploited

  • Ownership: Clear data ownership
  • Governance: Policies and controls
  • Protection: Security and compliance
  • Exploitation: Maximize value extraction

Data as a Product

Concept: Product thinking for data processing and delivery

  • User satisfaction: Data users are customers
  • Quality: Data must meet user needs
  • Discoverability: Easy to find and understand
  • Self-service: Users can access independently

Modern approach: Combine both perspectives for effective data management

Big Data Platform Architecture

Onion Architecture

┌─────────────────────────────────────────┐
│  Big Data Services & Applications       │
│  (Analytics, ML, BI, Dashboards)        │
├─────────────────────────────────────────┤
│  Middleware Platforms                   │
│  (Building, deploying, operating        │
│   reliable big data services)           │
├─────────────────────────────────────────┤
│  Data-Centric Virtualized               │
│  Infrastructures                        │
│  (Compute, Storage, Network)            │
├─────────────────────────────────────────┤
│  Consumers & Producers                  │
│  (Sensors, Things, People, Processes)   │
└─────────────────────────────────────────┘

Core Platform Services

┌─────────────────────────────────────────┐
│  Data Sources                           │
│  (IoT, B2B/B2C, Transport, etc.)        │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Core Services                          │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │  Ingest  │  │  Store   │            │
│  └──────────┘  └──────────┘            │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │Processing│  │  Query   │            │
│  └──────────┘  └──────────┘            │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Analytics & Applications               │
│  - ML Algorithms & Pipelines            │
│  - Visualization                        │
│  - Business Applications                │
└─────────────────────────────────────────┘

Big Data Pipelines

Typical Pipeline Stages

1. Data Ingestion

  • Extract data from various sources
  • Transform data into usable formats
  • Load data into storage (ETL)

2. Data Storage

  • Raw data storage (data lake)
  • Processed data storage
  • Analytical data storage (data warehouse)

3. Data Processing

  • Batch processing (Hadoop, Spark)
  • Stream processing (Kafka, Flink)
  • Machine learning (training, inference)

4. Data Analysis

  • Querying (SQL, NoSQL)
  • Analytics (aggregations, statistics)
  • Visualization (dashboards, reports)

Example Pipeline

Data Source → Ingestion → Raw Data Store
                ↓
         Data Processing
                ↓
         Analytical Store → Visualization
                ↓
         Machine Learning → Model Serving

Core Principles for Big Data Platforms

┌─────────────────────────────────────────┐
│  You (Big Data Platform Expert)         │
│                                         │
│  ┌─────────────┐  ┌─────────────┐      │
│  │Programming  │  │ Data Mgmt   │      │
│  │Models &     │  │ Models &    │      │
│  │Frameworks   │  │ Tools       │      │
│  └─────────────┘  └─────────────┘      │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Large-Scale Computing       │       │
│  │ Platforms (Service-Based)   │       │
│  └─────────────────────────────┘       │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Provisioning, Automation    │       │
│  │ and Analytics Processes     │       │
│  └─────────────────────────────┘       │
└─────────────────────────────────────────┘

Key Focus Areas

1. Design/Development vs Operation

  • Design: Architecture, data models, algorithms
  • Operation: Deployment, monitoring, maintenance

2. Data-Centric vs Service-Centric

  • Data-centric: Data models, storage, governance
  • Service-centric: APIs, microservices, orchestration
  • Platform-centric: Infrastructure, scalability, reliability

3. SQL-Style vs Programmatic Processing

  • SQL-style: Declarative queries, BI tools
  • Programmatic: MapReduce, Spark, custom workflows

4. Quality and Governance

  • Data quality: Accuracy, completeness, timeliness
  • Governance: Policies, compliance, security

Learning Goals

As a User

  • Able to use and program atop big data platforms
  • Understand available services and APIs

As a Provider

  • Able to operate big data platforms
  • Monitor, scale, and maintain systems

As a Designer/Architect

  • Able to design new solutions for big data platforms
  • Make architectural decisions

As a Developer

  • Able to develop services/applications in big data platforms
  • Implement data pipelines and analytics

Data Governance Concerns

Key Challenges

1. Data Quality

  • Accuracy: Is the data correct?
  • Completeness: Is all data present?
  • Timeliness: Is data current?

2. Data Lineage

  • Tracking: Where did data come from?
  • Transformations: How was it processed?
  • Audit trail: Who accessed it?

3. Multi-Tenancy

  • Isolation: Separate tenant data
  • SLAs: Different service levels
  • Security: Access control

4. Compliance

  • GDPR: Right to be forgotten
  • Privacy: Data protection
  • Regulations: Industry-specific rules

Example Concerns

Data → Analysis → Result
  ↓       ↓         ↓
Where?  Price?  Quality?
Privacy? Ethics? Compliance?

Processing Models

Batch Processing

Characteristics:

  • Process large volumes of data
  • Not real-time (hours to days)
  • High throughput

Example:

Daily customer transactions → Batch ETL → Data warehouse

Stream Processing

Characteristics:

  • Process data as it arrives
  • Low latency (seconds to minutes)
  • Continuous processing

Example:

IoT sensor data → Stream processing → Real-time alerts

Real-Time Analytics

Characteristics:

  • Immediate insights
  • Sub-second latency
  • Interactive queries

Example:

User clicks → Real-time recommendation engine

Summary

Big data platforms are essential infrastructure for modern data-driven applications:

  • Big data is characterized by Volume, Variety, Velocity, and Veracity
  • Platforms enable interactions between data producers and consumers
  • Architectures follow layered approaches (onion model)
  • Pipelines move data through ingestion, storage, processing, and analysis
  • Governance ensures quality, security, and compliance
  • Multiple processing models serve different use cases

Understanding big data platforms is fundamental for building scalable, data-intensive applications.

Further Reading

← Back to Chapter Home