10. Data Sourcing and Cleaning

Big Data has become a cornerstone of modern computing, enabling organizations to extract insights from massive datasets that were previously impossible to process. This lecture covers the fundamentals of big data and the platforms that make big data processing possible.

What is Big Data?

Definition

Big Data refers to:

[!NOTE] It’s not just about size

Big data is characterized by multiple dimensions beyond just volume.

The Four V’s of Big Data

1. Volume

2. Variety

3. Velocity

4. Veracity

┌─────────────────────────────────────────┐
│  Big Data = Volume + Variety +          │
│             Velocity + Veracity          │
└─────────────────────────────────────────┘

Sources of Big Data

1. Social Media

Data generated by human activities:

2. Internet of Things (IoT) / Industry 4.0

Data from monitoring equipment and environments:

3. Advanced Sciences

Data from advanced instruments:

4. Business Data

Operational vs Analytical Data

Operational Data (OLTP)

Purpose: Business/system operations

┌─────────────────────────────────┐
│  Operational Database           │
│  - Current state                │
│  - Frequent updates             │
│  - Transaction focused          │
│  - Normalized schema            │
└─────────────────────────────────┘

Analytical Data (OLAP)

Purpose: Understanding and optimization

┌─────────────────────────────────┐
│  Analytical Data Warehouse      │
│  - Historical data              │
│  - Read-optimized               │
│  - Aggregated views             │
│  - Denormalized schema          │
└─────────────────────────────────┘

Both types matter: Big data platforms must handle both operational and analytical workloads.

Why Big Data Matters

The Value of Data

Current hot topics: Large Language Models (LLMs) / Generative AI

Top-down perspective: Data economy

More data → More insights → Better decisions → Business success

Bottom-up perspective: Optimization

Understanding → Optimizing → Saving cost / Creating value

The Unreasonable Effectiveness of Data

Key Principle: With more data, the same algorithm performs much better!

Source: Halevy, Norvig, and Pereira (2009)

This principle shows that:

What are Platforms?

Business Definition

From “Platform Revolution”:

“A platform is a business based on enabling value-creating interactions between external producers and consumers. The platform provides an open, participative infrastructure for these interactions and sets governance conditions for them.”

Big Data Platform Interpretation

Big data platforms are large-scale service platforms that:

Not just: A database or data marketplace (even if big!)

Data Perspectives

Data as an Asset

Concept: Data is valuable and must be managed and exploited

Data as a Product

Concept: Product thinking for data processing and delivery

Modern approach: Combine both perspectives for effective data management

Big Data Platform Architecture

Onion Architecture

┌─────────────────────────────────────────┐
│  Big Data Services & Applications       │
│  (Analytics, ML, BI, Dashboards)        │
├─────────────────────────────────────────┤
│  Middleware Platforms                   │
│  (Building, deploying, operating        │
│   reliable big data services)           │
├─────────────────────────────────────────┤
│  Data-Centric Virtualized               │
│  Infrastructures                        │
│  (Compute, Storage, Network)            │
├─────────────────────────────────────────┤
│  Consumers & Producers                  │
│  (Sensors, Things, People, Processes)   │
└─────────────────────────────────────────┘

Core Platform Services

┌─────────────────────────────────────────┐
│  Data Sources                           │
│  (IoT, B2B/B2C, Transport, etc.)        │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Core Services                          │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │  Ingest  │  │  Store   │            │
│  └──────────┘  └──────────┘            │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │Processing│  │  Query   │            │
│  └──────────┘  └──────────┘            │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Analytics & Applications               │
│  - ML Algorithms & Pipelines            │
│  - Visualization                        │
│  - Business Applications                │
└─────────────────────────────────────────┘

Big Data Pipelines

Typical Pipeline Stages

1. Data Ingestion

2. Data Storage

3. Data Processing

4. Data Analysis

Example Pipeline

Data Source → Ingestion → Raw Data Store
                ↓
         Data Processing
                ↓
         Analytical Store → Visualization
                ↓
         Machine Learning → Model Serving

Core Principles for Big Data Platforms

┌─────────────────────────────────────────┐
│  You (Big Data Platform Expert)         │
│                                         │
│  ┌─────────────┐  ┌─────────────┐      │
│  │Programming  │  │ Data Mgmt   │      │
│  │Models &     │  │ Models &    │      │
│  │Frameworks   │  │ Tools       │      │
│  └─────────────┘  └─────────────┘      │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Large-Scale Computing       │       │
│  │ Platforms (Service-Based)   │       │
│  └─────────────────────────────┘       │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Provisioning, Automation    │       │
│  │ and Analytics Processes     │       │
│  └─────────────────────────────┘       │
└─────────────────────────────────────────┘

Key Focus Areas

1. Design/Development vs Operation

2. Data-Centric vs Service-Centric

3. SQL-Style vs Programmatic Processing

4. Quality and Governance

Learning Goals

As a User

As a Provider

As a Designer/Architect

As a Developer

Data Governance Concerns

Key Challenges

1. Data Quality

2. Data Lineage

3. Multi-Tenancy

4. Compliance

Example Concerns

Data → Analysis → Result
  ↓       ↓         ↓
Where?  Price?  Quality?
Privacy? Ethics? Compliance?

Processing Models

Batch Processing

Characteristics:

Example:

Daily customer transactions → Batch ETL → Data warehouse

Stream Processing

Characteristics:

Example:

IoT sensor data → Stream processing → Real-time alerts

Real-Time Analytics

Characteristics:

Example:

User clicks → Real-time recommendation engine

Summary

Big data platforms are essential infrastructure for modern data-driven applications:

Understanding big data platforms is fundamental for building scalable, data-intensive applications.

Further Reading