11-01 Data Sourcing, Cleaning, and Preparation Required

Data preparation is a critical step in any data analytics or machine learning project. This lecture covers the essential techniques for sourcing data from various sources and cleaning it for analysis.

Data Sourcing

Data Licensing

[!WARNING] Can you use any data?

Always check data sources for restrictions before use!

Common licenses:

MIT License

Very permissive
Can use commercially
Must keep license and copyright notice

GPL (v2/v3)

Copyleft license
Must distribute source code of anything built with it
Must include copyright, license, link to original, and change details
“Spreads virally” - anything using it must become GPL

Other Considerations

Commercial use restrictions
Foreign use limitations
Attribution requirements
Data privacy regulations (GDPR, CCPA)

Types of Data Sources

1. Refined Datasets

Characteristics:

Academic or governmental datasets
Released to the public
Pre-cleaned and formatted
Table-like format (CSV, databases)

Advantages:

✓ Missing values handled
✓ Parsing done
✓ Easier to download
✓ Examples from other papers

Examples:

MNIST: Handwritten digits dataset
ImageNet: Image classification dataset
UCI Machine Learning Repository
Kaggle Datasets
Government open data portals

2. Raw Data

Characteristics:

Unprocessed data from original sources
May have missing values, formatting issues
Requires significant cleaning

Sources:

Social media (Twitter, Facebook)
Sensor data (IoT devices)
Scientific instruments
Web scraping
APIs

Challenges:

⚠ Missing values
⚠ Inconsistent formatting
⚠ Non-normalized data
⚠ Requires extensive cleaning

Web Scraping

Process

Use headless browser to download JavaScript objects/HTML
Parse HTML to extract relevant data
Save in structured format (CSV, JSON, database)

Example with BeautifulSoup

from bs4 import BeautifulSoup

html_content = """
<html><body><ul>
  <li class="shoe-item">Air Jordans</li>
  <li class="shoe-item">Light up Sketchers</li>
</ul></body></html>
"""

soup = BeautifulSoup(html_content, 'html.parser')

for item in soup.find_all(attrs={'class': 'shoe-item'}):
    print(item.text)

# Output:
# Air Jordans
# Light up Sketchers

Web Scraping Challenges

⚠ Rate limiting: Servers may block excessive requests
⚠ IP blocking: May get banned for aggressive scraping
⚠ Legal issues: May violate terms of service
⚠ Fragile code: Breaks when HTML structure changes
⚠ Dynamic content: JavaScript-rendered content requires special handling

Best Practice: Use official APIs when available instead of scraping

Data Cleaning

Why Clean Data?

Data quality issues include:

Missing values: Incomplete records
Duplicate values: Repeated entries
Invalid values: Data outside expected range
Useless values: Irrelevant information
Inconsistent formats: Different representations of same data

Goal: Ensure data is as representative and accurate as possible for machine learning algorithms

Data Normalization

Data should be in a normalized format:

Before:                    After:
Name: "John Smith"         first_name: "John"
Age: "25 years"           last_name: "Smith"
City: "NYC"               age: 25
                          city: "New York City"

Handling Missing Values

Strategies

1. Drop Data

import pandas as pd

# Drop rows with any missing values
df_clean = df.dropna()

# Drop rows where specific column is missing
df_clean = df.dropna(subset=['age'])

# Drop columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df)*0.5)

Considerations:

⚠ May skew dataset
⚠ Loss of information
✓ Simple and fast
✓ Works when few missing values

2. Imputation - Fill with Mean/Median

# Fill with mean
df['age'].fillna(df['age'].mean(), inplace=True)

# Fill with median (better for skewed data)
df['salary'].fillna(df['salary'].median(), inplace=True)

# Fill with mode (for categorical data)
df['category'].fillna(df['category'].mode()[0], inplace=True)

Considerations:

✓ Preserves dataset size
⚠ Introduces bias
⚠ Reduces variance

3. Forward/Backward Fill

# Forward fill (use previous value)
df['temperature'].fillna(method='ffill', inplace=True)

# Backward fill (use next value)
df['temperature'].fillna(method='bfill', inplace=True)

Use case: Time series data

4. Interpolation

# Linear interpolation
df['value'].interpolate(method='linear', inplace=True)

# Polynomial interpolation
df['value'].interpolate(method='polynomial', order=2, inplace=True)

5. Probabilistic Sampling

import numpy as np

# Sample from distribution
mean = df['age'].mean()
std = df['age'].std()
missing_count = df['age'].isnull().sum()

df.loc[df['age'].isnull(), 'age'] = np.random.normal(mean, std, missing_count)

6. Leave As Is

Some algorithms handle missing values:

Decision trees
Random forests
XGBoost

Advantage: Minimal bias introduction

Handling Duplicates

Identifying Duplicates

# Find duplicate rows
duplicates = df.duplicated()

# Find duplicates based on specific columns
duplicates = df.duplicated(subset=['name', 'email'])

# View duplicate rows
df[df.duplicated(keep=False)]

Removing Duplicates

# Remove all duplicates
df_clean = df.drop_duplicates()

# Keep first occurrence
df_clean = df.drop_duplicates(keep='first')

# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')

# Remove based on specific columns
df_clean = df.drop_duplicates(subset=['user_id'])

Considerations

[!IMPORTANT] Duplicates can have meaning!

Multiple purchases by same customer

Repeated sensor readings

Time-series data

Questions to ask:

What defines a duplicate? (Exact match? Close match?)
Should we keep any duplicates?
Is there temporal ordering to consider?

Handling Invalid and Useless Values

Invalid Values

Problem domain specifies what is valid:

# Example: Age must be positive
df = df[df['age'] > 0]

# Example: Percentage must be 0-100
df = df[(df['percentage'] >= 0) & (df['percentage'] <= 100)]

# Example: Date must be in past
from datetime import datetime
df = df[df['date'] <= datetime.now()]

Outliers

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['value']))
df_clean = df[z_scores < 3]

# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
df_clean = df[(df['value'] >= Q1 - 1.5*IQR) & 
              (df['value'] <= Q3 + 1.5*IQR)]

Useless Values

Tip: Keep them around - they may be useful later!

Columns with single unique value
Highly correlated features
Irrelevant to problem domain

Data Processing Pipelines

Pipeline Stages

┌─────────────┐
│ Data Source │
└──────┬──────┘
       │
┌──────▼──────┐
│  Extract    │ (Get data from source)
└──────┬──────┘
       │
┌──────▼──────┐
│ Transform   │ (Clean, normalize, enrich)
└──────┬──────┘
       │
┌──────▼──────┐
│   Load      │ (Store in destination)
└──────┬──────┘
       │
┌──────▼──────┐
│  Validate   │ (Check quality)
└─────────────┘

Example Pipeline with Pandas

import pandas as pd

def data_pipeline(input_file, output_file):
    # 1. Extract
    df = pd.read_csv(input_file)
    
    # 2. Transform
    # Handle missing values
    df['age'].fillna(df['age'].median(), inplace=True)
    df['category'].fillna('Unknown', inplace=True)
    
    # Remove duplicates
    df = df.drop_duplicates(subset=['user_id'])
    
    # Handle invalid values
    df = df[df['age'] > 0]
    df = df[df['age'] < 120]
    
    # Normalize formats
    df['email'] = df['email'].str.lower().str.strip()
    df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)
    
    # 3. Load
    df.to_csv(output_file, index=False)
    
    # 4. Validate
    print(f"Original rows: {len(pd.read_csv(input_file))}")
    print(f"Cleaned rows: {len(df)}")
    print(f"Missing values:\n{df.isnull().sum()}")
    
    return df

# Run pipeline
clean_data = data_pipeline('raw_data.csv', 'clean_data.csv')

Loading Data into Processing Frameworks

Hadoop / Spark

# Copy to HDFS
hdfs dfs -copyFromLocal local_file.csv /data/input/

# Or use Spark
spark-submit --master yarn load_data.py

SQL Databases

import pandas as pd
from sqlalchemy import create_engine

# Create connection
engine = create_engine('postgresql://user:pass@localhost/db')

# Load data
df.to_sql('table_name', engine, if_exists='replace', index=False)

Cloud Storage

# AWS S3
import boto3
s3 = boto3.client('s3')
s3.upload_file('local_file.csv', 'bucket-name', 'data/file.csv')

# Google Cloud Storage
from google.cloud import storage
client = storage.Client()
bucket = client.bucket('bucket-name')
blob = bucket.blob('data/file.csv')
blob.upload_from_filename('local_file.csv')

Large Data Transfer

For very large datasets (> 1 PB):

AWS Snowball: Physical device for bulk data transfer
Azure Data Box: Microsoft’s physical transfer solution
Google Transfer Appliance: Google’s hardware solution

Best Practices

1. Document Everything

# Keep track of transformations
cleaning_log = {
    'missing_values_filled': ['age', 'salary'],
    'duplicates_removed': 150,
    'outliers_removed': 23,
    'invalid_values_removed': 5
}

2. Validate at Each Step

def validate_data(df, stage):
    print(f"\n=== Validation at {stage} ===")
    print(f"Shape: {df.shape}")
    print(f"Missing values:\n{df.isnull().sum()}")
    print(f"Duplicates: {df.duplicated().sum()}")
    print(f"Data types:\n{df.dtypes}")

3. Keep Original Data

# Never modify original
df_original = pd.read_csv('data.csv')
df_working = df_original.copy()

# Work on copy
df_working = clean_data(df_working)

4. Automate Repetitive Tasks

def clean_column(series, strategy='mean'):
    """Reusable cleaning function"""
    if strategy == 'mean':
        return series.fillna(series.mean())
    elif strategy == 'median':
        return series.fillna(series.median())
    elif strategy == 'mode':
        return series.fillna(series.mode()[0])
    return series

Summary

Data sourcing and cleaning are critical steps in the data pipeline:

Sourcing: Obtain data from refined datasets, APIs, or web scraping
Licensing: Always check data usage rights
Missing values: Drop, impute, or leave based on context
Duplicates: Remove carefully, considering domain meaning
Invalid values: Filter based on business rules
Pipelines: Automate ETL processes
Validation: Check quality at each step

Remember: Data scientists spend 60-80% of their time on data preparation - it’s worth doing well!

Service-Oriented Architecture and Cloud Computing