<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Service-Oriented Architecture and Cloud Computing</title>
 <link href="https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/atom.xml" rel="self"/>
 <link href="https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/"/>
 <updated>2026-01-11T18:29:25+00:00</updated>
 <id>https://nglelinh.github.io</id>
 <author>
   <name>Nguyen Le Linh</name>
   <email>nglelinh@gmail.com</email>
 </author>

 
 <entry>
   <title>author details</title>
   <link href="https://nglelinh.github.io/home/author-details/"/>
   <updated>2021-05-20T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/home/author-details</id>
   <content type="html">
</content>
 </entry>
 
 <entry>
   <title>reference</title>
   <link href="https://nglelinh.github.io/reference/26_reference/"/>
   <updated>2021-03-28T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/reference/26_reference</id>
   <content type="html">&lt;ol&gt;
  &lt;li&gt;Boyd, S. and Vandenberghe, L. (2004). &lt;em&gt;&lt;a href=&quot;https://web.stanford.edu/~boyd/cvxbook/&quot;&gt;Convex Optimization&lt;/a&gt;&lt;/em&gt;. Cambridge University Press.&lt;/li&gt;
  &lt;li&gt;Boyd, S. and Vandenberghe, L. (2014). &lt;em&gt;&lt;a href=&quot;https://web.stanford.edu/~boyd/cvxbook/bv_cvxslides.pdf&quot;&gt;Convex Optimization Lecture Slides&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Tibshirani, R. (2016). &lt;em&gt;&lt;a href=&quot;http://www.stat.cmu.edu/~ryantibs/convexopt/&quot;&gt;Convex Optimization: Fall 2016&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Jensen%27s_inequality&quot;&gt;Jensen’s inequality&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Positive-definite_matrix#Further_properties&quot;&gt;Positive-definite matrix - properties&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Matrix_decomposition#Eigendecomposition&quot;&gt;Eigendecomposition&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices&quot;&gt;Eigendecomposition of a matrix - Real symmetric matrices&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Schur_complement&quot;&gt;Schur complement&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Quasiconvex_function&quot;&gt;Quasiconvex function&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Boyd, S. and Dattorro, J. (2013). &lt;em&gt;&lt;a href=&quot;https://web.stanford.edu/class/ee392o/alt_proj.pdf&quot;&gt;Alternating Projections&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Max-flow_min-cut_theorem&quot;&gt;Max-flow min-cut theorem&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;em&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm&quot;&gt;Ford-Fulkerson algorithm&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Tibshirani, J. R. (2013). &lt;em&gt;&lt;a href=&quot;https://projecteuclid.org/download/pdfview_1/euclid.ejs/1369148600&quot;&gt;The lasso problem and uniqueness&lt;/a&gt;&lt;/em&gt;. Electronic Journal of Statistics
Vol. 7. pp. 1456–1490.&lt;/li&gt;
  &lt;li&gt;Nocedal, J. (2006). &lt;em&gt;Numerical Optimization 2nd ed&lt;/em&gt;. Springer.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;a href=&quot;https://en.wikipedia.org/wiki/Frank%E2%80%93Wolfe_algorithm&quot;&gt;Frank–Wolfe algorithm&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Wikipedia. &lt;a href=&quot;https://en.wikipedia.org/wiki/Coordinate_descent&quot;&gt;Coordinate descent&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
</content>
 </entry>
 
 <entry>
   <title>makers</title>
   <link href="https://nglelinh.github.io/home/makers/"/>
   <updated>2021-02-03T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/home/makers</id>
   <content type="html">
</content>
 </entry>
 
 <entry>
   <title>Conventions</title>
   <link href="https://nglelinh.github.io/contribution/conventions/"/>
   <updated>2021-02-03T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contribution/conventions</id>
   <content type="html">&lt;h2 id=&quot;1-directory-convention&quot;&gt;1. Directory Convention&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;주요 컨텐츠는 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;contents/chapter&lt;/code&gt;로 시작하는 디렉토리에 포함되어 있습니다. 또한 컨텐츠에 필요한 이미지는 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image&lt;/code&gt; 디렉토리에 들어 있습니다.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;contents/chapter&lt;/code&gt;의 내부 디렉토리는 다음과 같습니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;contents
├── chapter01
│   ├── _posts
│   │   ├── 21-01-07-01_00_Introduction.md
│   │   ├── 21-01-07-01_01_optimization_problems.md
│   │   ├── 21-01-28-01_02_convex_optimization_problem.md
│   │   ├── 21-01-28-01_03_goals_and_topics.md
│   │   └── 21-01-28-01_04_brief_history_of_convex_optimization.md
│   ├── index.html
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Jekyll에서는 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_posts&lt;/code&gt; 디렉토리 내에 있는 Markdown 또는 html 파일을 블로그의 Posting으로 인식합니다. 따라서 새로운 포스팅을 작성하고자 한다면 각 디렉토리의 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_posts&lt;/code&gt;에 새로 파일을 추가하시면 됩니다.&lt;/li&gt;
  &lt;li&gt;Jekyll의 Posting 파일들은 모두 아래와 같은 Naming Convention을 따라야 합니다.
    &lt;ul&gt;
      &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;yy-mm-dd-new_posting_name.md&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chapter&lt;/code&gt;와 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image&lt;/code&gt; 디렉토리 외의 내용들은 모두 Blog의 설정과 관련된 것들입니다. 안정적인 운영을 위해 설정과 관련된 부분들에 대해서는 직접 편집보다는 이슈로 작성해주시면 처리하겠습니다(관련 내용 수정 시 PR Merge가 어려울 수 있습니다).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;2-posting-convention&quot;&gt;2. Posting Convention&lt;/h2&gt;

&lt;h3 id=&quot;21-header-field&quot;&gt;2.1. Header Field&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;모든 Posting 파일들은 다음 예시와 같은 Header를 가지고 있어야 합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;---
layout: post
title: Quasi-Newton Methods
chapter: &quot;18&quot;
order: 1
owner: &quot;Kyeongmin Woo&quot;
---
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;layout&lt;/strong&gt;은 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;post&lt;/code&gt;여야 합니다.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;title&lt;/strong&gt;은 내용에 맞게 임의의 String으로 설정할 수 있습니다.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;chapter&lt;/strong&gt;는 상위 카테고리의 마지막 두 숫자를 String으로 표기합니다. 다만 한 자리수인 경우 “01”과 같이 0을 붙여줘야 합니다.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;order&lt;/strong&gt;는 해당 chapter 내에서의 정렬 순서를 의미합니다.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;owner&lt;/strong&gt;는 해당 post의 관리자를 의미합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;22-latex&quot;&gt;2.2. Latex&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;수식은 Latex 문법에 따라 표기합니다.&lt;/li&gt;
  &lt;li&gt;$$ 와 같이 double dollar sign을 사용하여 수식임을 나타냅니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$$\theta x_1 + (1-\theta)x_2 \in C$$
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;위 수식은 다음과 같이 표기됩니다.&lt;/p&gt;

\[\theta x_1 + (1-\theta)x_2 \in C\]

&lt;h3 id=&quot;23-image-convention&quot;&gt;2.3. Image Convention&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Posting 파일에서 이미지를 삽일할 때 아래의 Convention을 따라야합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;figure class=&quot;image&quot; style=&quot;align: center;&quot;&amp;gt;
&amp;lt;p align=&quot;center&quot;&amp;gt;
  &amp;lt;img src=&quot;{image_path}&quot; alt=&quot;{description of image}&quot; width=&quot;{scale_ratio}%&quot; height=&quot;{scale_ratio}%&quot;&amp;gt;
  &amp;lt;figcaption style=&quot;text-align: center;&quot;&amp;gt;{figcaption}&amp;lt;/figcaption&amp;gt;
&amp;lt;/p&amp;gt;
&amp;lt;/figure&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;figure class는 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image&lt;/code&gt;여야합니다.&lt;/li&gt;
  &lt;li&gt;{}에 들어갈 내용을 적절히 넣어야합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;24-hyperlink-convention&quot;&gt;2.4 Hyperlink Convention&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Blog 내의 Post에 대한 hyperlink는 jekyll multilang_post_url 사용합니다. 첫 번째 Post인 &lt;a href=&quot;#post-not-found&quot;&gt;Optimization problems?&lt;/a&gt;의 hyperlink는 아래와 같이 작성합니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[Optimization problems?](#post-not-found)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;기타 외부 Url로의 hyperlink는 다음과 같이 작성할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[Convex Optimization 위키](&amp;lt;https://bit.ly/2PXv736&amp;gt;)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;3-github-convnention&quot;&gt;3. GitHub Convnention&lt;/h2&gt;

&lt;p&gt;작성 내용에 질문이 있거나 수정 사항을 발견하신 경우 다음 두 방법 중 하나로 남겨주시면 됩니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;댓글 작성하기&lt;/li&gt;
  &lt;li&gt;Repogitory에 이슈 생성하기&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;새로운 내용을 추가하거나 직접 편집하시고 싶으신 경우에는 새로운 Branch를 생성하여 먼저 수정하신 후 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Pull Request&lt;/code&gt;를 생성해주시면 됩니다. 신규 작성 및 기존 내용 수정은 누구나 가능합니다.&lt;/p&gt;

&lt;h3 id=&quot;31-repository-policy&quot;&gt;3.1. Repository Policy&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main&lt;/code&gt; 브랜치로 Merge 되기 위해서는 1명 이상의 Reviewer가 Approve 해야 합니다. CODEOWNERS 시스템이 도입되어 있어 각 Chapter 별 Reviewer가 자동으로 할당됩니다.&lt;/p&gt;

&lt;h3 id=&quot;32-branch-naming-convention&quot;&gt;3.2. Branch Naming Convention&lt;/h3&gt;

&lt;p&gt;브랜치 이름은 다음 컨벤션에 맞춰 생성해주시면 됩니다.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[feature|bugfix]/[chapter**|settings]-변경-사항
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Prefix는 feature와 bugfix 두 가지를 사용합니다. 각각의 사용 예시는 다음과 같습니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;feature
    &lt;ul&gt;
      &lt;li&gt;Migration 작업&lt;/li&gt;
      &lt;li&gt;문장/수식/이미지 등이 달라지는 경우&lt;/li&gt;
      &lt;li&gt;새로운 내용이 추가되는 경우&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;bugfix
    &lt;ul&gt;
      &lt;li&gt;오타를 수정하는 경우&lt;/li&gt;
      &lt;li&gt;latex view가 깨져 수정하는 경우&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;구체적인 예시는 아래와 같습니다.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature/chapter01-migration&lt;/code&gt;: chapter01 Migration&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature/chapter01-fix-formula&lt;/code&gt;: chapter01에서 수식 업데이트&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;feature/settings-update-branch-convention&lt;/code&gt;: Convention 업데이트&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bugfix/chapter01-fix-typo&lt;/code&gt;: chapter01에서 오타 수정&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>contents</title>
   <link href="https://nglelinh.github.io/home/link_to_how_to_contribute/"/>
   <updated>2021-01-27T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/home/link_to_how_to_contribute</id>
   <content type="html">
</content>
 </entry>
 
 <entry>
   <title>Initial Settings</title>
   <link href="https://nglelinh.github.io/contribution/initial_settings/"/>
   <updated>2021-01-27T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contribution/initial_settings</id>
   <content type="html">&lt;p&gt;The contents of this repository are hosted as a &lt;a href=&quot;https://convex-optimization-for-all.github.io/&quot;&gt;Github Blog&lt;/a&gt; using Jekyll.
Therefore, to edit existing content or create new content, you must follow Jekyll’s directory structure and content writing conventions.
Additionally, you need to verify that changes are properly reflected in your local environment (your current computer) through a web browser.&lt;/p&gt;

&lt;p&gt;We have compiled environment setup instructions for those who are not familiar with GitHub or Jekyll.
If you have difficulties following the guide, please leave an &lt;a href=&quot;https://github.com/convex-optimization-for-all/convex-optimization-for-all.github.io/issues&quot;&gt;issue&lt;/a&gt; in the repository or contact us via the email below for assistance.&lt;/p&gt;

&lt;p&gt;(Kyeongmin Woo, wgm0601@gmail.com)&lt;/p&gt;

&lt;h2 id=&quot;1-git-installation&quot;&gt;1. Git Installation&lt;/h2&gt;

&lt;p&gt;All work management for this blog is performed through Git and GitHub. Please visit the website below to install Git.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://git-scm.com/downloads&quot;&gt;https://git-scm.com/downloads&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;2-downloading-the-repository&quot;&gt;2. Downloading the Repository&lt;/h2&gt;

&lt;p&gt;To modify the blog, enter the following command in the terminal to download the blog’s source code.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone https://github.com/convex-optimization-for-all/convex-optimization-for-all.github.io.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;3-local-hosting&quot;&gt;3. Local Hosting&lt;/h2&gt;

&lt;p&gt;Before applying changed or modified content to the blog, you must verify that the work has been performed as intended through local hosting.
If you merge work content that does not follow Jekyll’s required conventions into the repository, the blog hosted on the actual web may not function properly.
For local hosting setup, you can choose between two methods: using a virtual environment (Docker) (Option 1) or directly installing the Jekyll environment locally (Option 2).
After completing the local hosting setup and running the local server, you can check your blog content through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;127.0.0.1:4000&lt;/code&gt; address in your web browser.&lt;/p&gt;

&lt;h3 id=&quot;3-1-option-1-docker-installation&quot;&gt;3-1. (Option 1) Docker Installation&lt;/h3&gt;

&lt;h3 id=&quot;a-docker-installation&quot;&gt;A. Docker Installation&lt;/h3&gt;

&lt;p&gt;Using Docker enables local hosting without direct environment installation on your local machine.
Please visit the website below to install Docker.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.docker.com/get-docker/&quot;&gt;https://docs.docker.com/get-docker/&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;b-local-hosting&quot;&gt;B. Local Hosting&lt;/h3&gt;

&lt;p&gt;Enter the following command in the terminal.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker-compose up
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-2-option-2-jekyll-environment-installation&quot;&gt;3-2. (Option 2) Jekyll Environment Installation&lt;/h3&gt;

&lt;h3 id=&quot;a-jekyll-and-ruby-package-installation&quot;&gt;A. Jekyll and Ruby Package Installation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://jekyllrb.com/docs/installation/&quot;&gt;Installing Ruby&lt;/a&gt;: Jekyll is built with Ruby. Therefore, you need to install Ruby to use Jekyll.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://jekyllrb.com/docs/&quot;&gt;Installing Jekyll&lt;/a&gt;: Once Ruby is installed, enter the cloned repository and install Jekyll.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://jekyllrb.com/docs/&quot;&gt;Installing Bundle Gem&lt;/a&gt;: You need to additionally install Ruby packages required for hosting. Run the following command in the repository’s project directory.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;bundle &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;b-local-hosting-1&quot;&gt;B. Local Hosting&lt;/h3&gt;

&lt;p&gt;Enter the following command in the terminal.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ jekyll serve
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If hosting doesn’t work, you can also try the following command.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ bundle exec jekyll serve
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If both commands don’t work, the Jekyll environment has not been properly installed.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to Contribute</title>
   <link href="https://nglelinh.github.io/contribution/how_to_contribute/"/>
   <updated>2021-01-27T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contribution/how_to_contribute</id>
   <content type="html">&lt;hr /&gt;

&lt;h2 id=&quot;1-컨텐츠를-직접-수정하는-방법&quot;&gt;1. 컨텐츠를 직접 수정하는 방법&lt;/h2&gt;

&lt;h3 id=&quot;1-우선-local의-repository-directory로-들어갑니다-local-repository가-없다면-initial-settings를-참고하시기-바랍니다&quot;&gt;(1) 우선 Local의 Repository Directory로 들어갑니다. Local Repository가 없다면 &lt;a href=&quot;https://convex-optimization-for-all.github.io/contribution/2021/01/27/initial_settings/&quot;&gt;Initial Settings&lt;/a&gt;를 참고하시기 바랍니다.&lt;/h3&gt;

&lt;h3 id=&quot;2--remote-저장소와의-정보를-동기화합니다&quot;&gt;(2)  Remote 저장소와의 정보를 동기화합니다.&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git checkout main
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git pull &lt;span class=&quot;nt&quot;&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-수정-내용을-담는-새로운-브랜치를-생성합니다-브랜치-명은-prefix챕터명수정하는_이유로-하시면-됩니다branch-naming-convetion-예시는-아래와-같습니다&quot;&gt;(3) 수정 내용을 담는 새로운 브랜치를 생성합니다. 브랜치 명은 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[Prefix]/[챕터명]/[수정하는_이유]&lt;/code&gt;로 하시면 됩니다(&lt;a href=&quot;https://convex-optimization-for-all.github.io/contribution/2021/02/03/conventions/&quot;&gt;Branch Naming Convetion&lt;/a&gt;). 예시는 아래와 같습니다.&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git checkout &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; bugfix/chapter01-fix-typo
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;4-파일을-편집합니다-생성-또는-수정하고자-하는-컨텐츠는-convention을-지켜-작성해야-합니다&quot;&gt;(4) 파일을 편집합니다. 생성 또는 수정하고자 하는 컨텐츠는 &lt;a href=&quot;https://convex-optimization-for-all.github.io/contribution/2021/02/03/conventions/&quot;&gt;Convention&lt;/a&gt;을 지켜 작성해야 합니다.&lt;/h3&gt;

&lt;h3 id=&quot;5-remote로-push합니다-예시는-아래와-같습니다&quot;&gt;(5) Remote로 Push합니다. 예시는 아래와 같습니다.&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git push origin bugfix/chapter01-fix-typo
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;6-github에서-main-branch로의-pull-request를-생성합니다-pull-request-생성-방법은-아래-github-docs를-참고하시기-바랍니다&quot;&gt;(6) &lt;a href=&quot;https://github.com/convex-optimization-for-all/convex-optimization-for-all.github.io/pulls&quot;&gt;Github&lt;/a&gt;에서 main branch로의 Pull Request를 생성합니다. Pull Request 생성 방법은 아래 GitHub Docs를 참고하시기 바랍니다.&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request&quot;&gt;Creating a pull request&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;2-컨텐츠-수정을-요청하는-방법&quot;&gt;2. 컨텐츠 수정을 요청하는 방법&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Github Repository에 &lt;a href=&quot;https://github.com/convex-optimization-for-all/convex-optimization-for-all.github.io/issues&quot;&gt;Issue&lt;/a&gt;를 생성하실 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;
</content>
 </entry>
 
 <entry>
   <title>introduction</title>
   <link href="https://nglelinh.github.io/home/introduction/"/>
   <updated>2021-01-20T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/home/introduction</id>
   <content type="html">
</content>
 </entry>
 
 <entry>
   <title>contents</title>
   <link href="https://nglelinh.github.io/home/contents/"/>
   <updated>2021-01-20T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/home/contents</id>
   <content type="html">&lt;p&gt;An introduction to convex optimization problems, concepts in convex analysis, convex optimization algorithms, duality theory, optimality conditions, and applications of convex optimization in statistics and machine learning.&lt;/p&gt;

&lt;h1 id=&quot;course-objectives&quot;&gt;Course Objectives&lt;/h1&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Provide students with fundamental knowledge of convex optimization to support their study and research in data science.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Enable students to understand optimization algorithms and use existing software to solve optimization problems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Begin to develop skills in analyzing and solving convex optimization problems in real-world applications, and in applying optimization algorithms to these problems.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;main-textbooks&quot;&gt;Main Textbooks&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Y. Nesterov, Lectures on Convex Optimization, 2018. [FIT_2101685_001]&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;S. Boyd &amp;amp; L. Vandenberghe, Convex Optimization, 2004.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;C. C. Aggarwal, Linear Algebra and Optimization for Machine Learning: A Textbook, 2020. [FIT_2101685_101]&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;C. Byrne, A First Course in Optimization, CRC Press, 2015.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;S. Sra, S. Nowozin, &amp;amp; S. J. Wright (eds.), Optimization for Machine Learning, MIT Press, 2012.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>14 Introduction to Infrastructure as Code</title>
   <link href="https://nglelinh.github.io/contents/en/chapter14/14_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter14/14_Introduction</id>
   <content type="html">&lt;p&gt;This chapter introduces Infrastructure as Code (IaC), the practice of managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Define Infrastructure as Code and its benefits (Version Control, Reproducibility)&lt;/li&gt;
  &lt;li&gt;Differentiate between Imperative and Declarative approaches&lt;/li&gt;
  &lt;li&gt;Learn the basics of Terraform (Providers, Resources, State)&lt;/li&gt;
  &lt;li&gt;Understand Ansible for configuration management&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;automating-the-cloud&quot;&gt;Automating the Cloud&lt;/h2&gt;

&lt;p&gt;Manual configuration (“ClickOps”) is error-prone and unscalable. IaC allows developers to treat infrastructure as software—enabling code review, testing, and automated deployment of the entire data center stack.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>14-01 Infrastructure as Code with Terraform</title>
   <link href="https://nglelinh.github.io/contents/en/chapter14/14_01_Infrastructure_as_Code/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter14/14_01_Infrastructure_as_Code</id>
   <content type="html">&lt;p&gt;Infrastructure as Code (IaC) has revolutionized how we manage and deploy cloud resources. Instead of manual configuration, we define our infrastructure in software, bringing the power of version control and automation to operations.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-with-manual-deployment&quot;&gt;The Problem with Manual Deployment&lt;/h2&gt;

&lt;h3 id=&quot;distributed-applications-are-complex&quot;&gt;Distributed Applications are Complex&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Configuration Sensitive&lt;/strong&gt;: Database servers need different settings than web servers.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Updated Continuously&lt;/strong&gt;: Code and patches deployed daily or hourly.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Human Factors&lt;/strong&gt;: Manual steps lead to “operator error”.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: Running on tens, hundreds, or thousands of nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;approaches-to-deployment&quot;&gt;Approaches to Deployment&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Manual Setup&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Click functionality in the web portal.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Does this scale?&lt;/strong&gt; Clearly no.&lt;/li&gt;
      &lt;li&gt;Error-prone and not reproducible.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Custom Scripts&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Use cloud provider APIs (e.g., AWS Boto3) to create machines.&lt;/li&gt;
      &lt;li&gt;Programmatically SSH to run tasks.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Does this scale?&lt;/strong&gt; Maybe, but hard to maintain. “Why reinvent the wheel?”&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Infrastructure as Code (IaC)&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Declare infrastructure in a specific format (code).&lt;/li&gt;
      &lt;li&gt;IaC framework deploys/updates the cloud infrastructure.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Does this scale?&lt;/strong&gt; Yes!&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;iac-concepts&quot;&gt;IaC Concepts&lt;/h2&gt;

&lt;h3 id=&quot;declarative-vs-imperative&quot;&gt;Declarative vs. Imperative&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Declarative&lt;/strong&gt;: Define the &lt;strong&gt;target state&lt;/strong&gt; (what you want).
    &lt;ul&gt;
      &lt;li&gt;&lt;em&gt;Example&lt;/em&gt;: “I want 3 VMs and a Load Balancer.”&lt;/li&gt;
      &lt;li&gt;The system figures out how to get there.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Imperative&lt;/strong&gt;: Define &lt;strong&gt;how&lt;/strong&gt; to change the state (steps to take).
    &lt;ul&gt;
      &lt;li&gt;&lt;em&gt;Example&lt;/em&gt;: “Create VM 1. Create VM 2. Create VM 3. Create Load Balancer. Add VMs to Load Balancer.”&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Intelligent&lt;/strong&gt;: Define relationships and constraints; the system figures out the updates.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;push-vs-pull&quot;&gt;Push vs. Pull&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Push&lt;/strong&gt;: Central server pushes configuration to child servers (e.g., Ansible, Terraform).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pull&lt;/strong&gt;: Child servers periodically check central server for configuration (e.g., Puppet, Chef).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;iac-tools-comparison&quot;&gt;IaC Tools Comparison&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Tool&lt;/th&gt;
      &lt;th&gt;Style&lt;/th&gt;
      &lt;th&gt;Method&lt;/th&gt;
      &lt;th&gt;Mechanism&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Ansible&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Declarative/Imperative&lt;/td&gt;
      &lt;td&gt;Push&lt;/td&gt;
      &lt;td&gt;SSH (Agentless)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Puppet&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Declarative&lt;/td&gt;
      &lt;td&gt;Pull&lt;/td&gt;
      &lt;td&gt;Agent-based&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Chef&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Imperative&lt;/td&gt;
      &lt;td&gt;Pull&lt;/td&gt;
      &lt;td&gt;Agent-based&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Terraform&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Declarative&lt;/td&gt;
      &lt;td&gt;Push&lt;/td&gt;
      &lt;td&gt;API (Agentless)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;terraform-by-hashicorp&quot;&gt;Terraform by HashiCorp&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Terraform&lt;/strong&gt; is the industry standard for cloud provisioning.&lt;/p&gt;

&lt;h3 id=&quot;key-features&quot;&gt;Key Features&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Open Source&lt;/strong&gt;: Huge community.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Platform Agnostic&lt;/strong&gt;: Supports AWS, Azure, GCP, Kubernetes, and hundreds more.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stateful&lt;/strong&gt;: Maintains a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tfstate&lt;/code&gt; file to know what currently exists.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Graph-Based&lt;/strong&gt;: Understands dependencies (e.g., create VPC &lt;em&gt;before&lt;/em&gt; creating Subnet).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;workflow&quot;&gt;Workflow&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Write Code&lt;/strong&gt;: Define resources in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.tf&lt;/code&gt; files.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Plan&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform plan&lt;/code&gt; compares code to current state and shows what &lt;em&gt;will&lt;/em&gt; happen.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Apply&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform apply&lt;/code&gt; executes the API calls to make the changes real.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;basic-syntax-hcl---hashicorp-configuration-language&quot;&gt;Basic Syntax (HCL - HashiCorp Configuration Language)&lt;/h3&gt;

&lt;p&gt;Terraform uses HCL, which is designed to be both machine-readable and human-readable.&lt;/p&gt;

&lt;h4 id=&quot;resource-definition&quot;&gt;Resource Definition&lt;/h4&gt;

&lt;div class=&quot;language-hcl highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# &amp;lt;Resource Type&amp;gt;     &amp;lt;Local Name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;aws_instance&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;web_server&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# Arguments&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;ami&lt;/span&gt;           &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;ami-0c55b159cbfafe1f0&quot;&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;instance_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;t2.micro&quot;&lt;/span&gt;
  
  &lt;span class=&quot;nx&quot;&gt;tags&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;Name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;MyWebServer&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Resource Type&lt;/strong&gt;: Defined by the provider (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws_instance&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Local Name&lt;/strong&gt;: Used to reference this resource elsewhere in code (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;aws_instance.web_server.public_ip&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;variables&quot;&gt;Variables&lt;/h4&gt;

&lt;div class=&quot;language-hcl highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;variable&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;region&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;type&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;string&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;default&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;us-east-1&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;nx&quot;&gt;provider&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;aws&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;region&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;region&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;interpolation&quot;&gt;Interpolation&lt;/h4&gt;

&lt;p&gt;Referencing values from other resources:&lt;/p&gt;

&lt;div class=&quot;language-hcl highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;resource&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;aws_eip&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;lb&quot;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# Reference the instance ID from the resource above&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;instance&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;aws_instance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;web_server&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;id&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;vpc&lt;/span&gt;      &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;module-structure&quot;&gt;Module Structure&lt;/h3&gt;

&lt;p&gt;Standard file organization:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;main.tf&lt;/strong&gt;: Core resource definitions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;variables.tf&lt;/strong&gt;: Input variable definitions (parameters).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;outputs.tf&lt;/strong&gt;: Output values (e.g., Load Balancer IP).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;providers.tf&lt;/strong&gt;: Provider configuration (AWS, Azure versioning).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;essential-commands&quot;&gt;Essential Commands&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform init&lt;/code&gt;&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Downloads provider plugins (e.g., AWS provider code).&lt;/li&gt;
      &lt;li&gt;Initializes the backend (state storage).&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform plan&lt;/code&gt;&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Dry run&lt;/strong&gt;: Shows execution plan.&lt;/li&gt;
      &lt;li&gt;Does NOT make changes.&lt;/li&gt;
      &lt;li&gt;Vital for safety (“measure twice, cut once”).&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform apply&lt;/code&gt;&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Executes the plan.&lt;/li&gt;
      &lt;li&gt;Creates/Updates/Deletes resources.&lt;/li&gt;
      &lt;li&gt;Can be destructive!&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;terraform destroy&lt;/code&gt;&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Removes all resources managed by the configuration.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;use-cases-for-terraform&quot;&gt;Use Cases for Terraform&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Multi-Tier Applications&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Define Web, App, and Database layers together.&lt;/li&gt;
      &lt;li&gt;Pass dependencies (DB Connection String) automatically to App layer.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Disposable Environments&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Spin up a full “Staging” environment that mimics “Production” exactly.&lt;/li&gt;
      &lt;li&gt;Destroy it when done to save money.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Multi-Cloud Deployments&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Manage AWS (Computing) and Cloudflare (DNS) in the same workflow.&lt;/li&gt;
      &lt;li&gt;Although code isn’t 100% portable (AWS resources != Azure resources), the &lt;em&gt;workflow&lt;/em&gt; is identical.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Infrastructure as Code allows us to treat operations like software development:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Version Control&lt;/strong&gt; (Git) your infrastructure.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Peer Review&lt;/strong&gt; changes before applying.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Automate&lt;/strong&gt; testing and deployment.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Terraform&lt;/strong&gt; provides a powerful, declarative way to manage massive scale across any cloud provider.&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>13 Introduction to Deployment and Security</title>
   <link href="https://nglelinh.github.io/contents/en/chapter13/13_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter13/13_Introduction</id>
   <content type="html">&lt;p&gt;This chapter focuses on the operational aspects of cloud computing: how to securely deploy, manage, and monitor applications in a production environment.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Understand cloud security fundamentals: Shared Responsibility Model, IAM, Encryption&lt;/li&gt;
  &lt;li&gt;Explore deployment strategies: Blue/Green, Canary, Rolling Updates&lt;/li&gt;
  &lt;li&gt;Configure CI/CD pipelines for automated delivery&lt;/li&gt;
  &lt;li&gt;Implement monitoring and logging for observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;operations--security&quot;&gt;Operations &amp;amp; Security&lt;/h2&gt;

&lt;p&gt;Building the application is only half the battle. Running it securely and reliably requires a deep understanding of network security groups, identity management, and automated deployment pipelines to reduce human error.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>13-01 Deployment, Security, and Compliance</title>
   <link href="https://nglelinh.github.io/contents/en/chapter13/13_01_Deployment_and_Security/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter13/13_01_Deployment_and_Security</id>
   <content type="html">&lt;p&gt;Deploying applications to the cloud requires not just getting the code running, but doing so securely and in compliance with regulations. This lecture covers deployment strategies, identity management, and compliance frameworks.&lt;/p&gt;

&lt;h2 id=&quot;deployment-strategies&quot;&gt;Deployment Strategies&lt;/h2&gt;

&lt;h3 id=&quot;the-need-for-automated-deployment&quot;&gt;The Need for Automated Deployment&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Deployment is not just “uploading files”&lt;/strong&gt;. It involves:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Infrastructure provisioning (VMs, databases, networks)&lt;/li&gt;
  &lt;li&gt;Configuration management&lt;/li&gt;
  &lt;li&gt;Application deployment&lt;/li&gt;
  &lt;li&gt;Managing dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why automate?&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Eliminate “it works on my machine” issues&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: Deploy to hundreds of servers as easily as one&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Speed&lt;/strong&gt;: Rapid iteration and recovery&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reliability&lt;/strong&gt;: Reduce human error&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;infrastructure-as-code-iac-overview&quot;&gt;Infrastructure as Code (IaC) Overview&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Note: We will cover IaC in depth in Chapter 14.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;“Infrastructure as Code” is the ability to deploy all your services in the cloud from code rather than manually via the portal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Options&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;ARM Templates / Bicep&lt;/strong&gt; (Azure native)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CloudFormation&lt;/strong&gt; (AWS native)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Terraform&lt;/strong&gt; (Multi-cloud, industry standard)&lt;/li&gt;
&lt;/ol&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Tool&lt;/th&gt;
      &lt;th&gt;Pros&lt;/th&gt;
      &lt;th&gt;Cons&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;ARM Templates&lt;/strong&gt; (JSON)&lt;/td&gt;
      &lt;td&gt;Native to Azure, comprehensive&lt;/td&gt;
      &lt;td&gt;Verbose, difficult to read/write&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Bicep&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Domain Specific Language (DSL), easy to read&lt;/td&gt;
      &lt;td&gt;Azure only, newer ecosystem&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Terraform&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Multi-cloud, massive community&lt;/td&gt;
      &lt;td&gt;State management complexity&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;deployment-pipelines-cicd&quot;&gt;Deployment Pipelines (CI/CD)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions&lt;/strong&gt; / &lt;strong&gt;Azure DevOps&lt;/strong&gt; allow you to:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Build&lt;/strong&gt; your code&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Test&lt;/strong&gt; your application&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Provision&lt;/strong&gt; infrastructure (e.g., using Terraform)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deploy&lt;/strong&gt; artifacts&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Verify&lt;/strong&gt; health&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;identity-and-access-management-iam&quot;&gt;Identity and Access Management (IAM)&lt;/h2&gt;

&lt;h3 id=&quot;azure-active-directory-azure-ad&quot;&gt;Azure Active Directory (Azure AD)&lt;/h3&gt;

&lt;p&gt;Azure AD (now Microsoft Entra ID) is the cloud identity provider.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Single Sign-On (SSO)&lt;/strong&gt;: One login for all apps&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;App Registrations&lt;/strong&gt;: Identity for applications&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service Principals&lt;/strong&gt;: Machine identities for automation&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Managed Identities&lt;/strong&gt;: Passwordless authentication for Azure resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;role-based-access-control-rbac&quot;&gt;Role-Based Access Control (RBAC)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: Grant access based on roles, not individual users.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Roles&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Owner&lt;/strong&gt;: Full access + Can assign roles to others&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Contributor&lt;/strong&gt;: Full access + Cannot assign roles&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reader&lt;/strong&gt;: View only, no changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scope&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Access can be scoped at different levels:
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Management Group&lt;/strong&gt; (Organization)&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Subscription&lt;/strong&gt; (Billing unit)&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Resource Group&lt;/strong&gt; (Logical grouping)&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Resource&lt;/strong&gt; (Individual item)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!TIP]
&lt;strong&gt;Least Privilege Principle&lt;/strong&gt;: Always grant the minimum permission necessary to perform the task. Start with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Reader&lt;/code&gt; and escalate only if needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;access-control-lists-acls&quot;&gt;Access Control Lists (ACLs)&lt;/h3&gt;

&lt;p&gt;While RBAC controls access to the &lt;strong&gt;resource&lt;/strong&gt; (e.g., “Can you see this Storage Account?”), ACLs often control access to the &lt;strong&gt;data&lt;/strong&gt; within (e.g., “Can you read this specific file?”).&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;RBAC&lt;/strong&gt;: Broad access (Control Plane)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ACLs&lt;/strong&gt;: Fine-grained access (Data Plane)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;security-and-compliance&quot;&gt;Security and Compliance&lt;/h2&gt;

&lt;h3 id=&quot;why-compliance-matters&quot;&gt;Why Compliance Matters?&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Compliance&lt;/strong&gt; means conformity in fulfilling official requirements.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Legal requirement&lt;/strong&gt;: Fines for non-compliance can be massive.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Trust&lt;/strong&gt;: Customers require proof that you protect their data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Standards&lt;/strong&gt;: Provides a framework for security best practices.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;key-compliance-frameworks&quot;&gt;Key Compliance Frameworks&lt;/h3&gt;

&lt;h4 id=&quot;healthcare&quot;&gt;Healthcare&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;HIPAA&lt;/strong&gt; (Health Insurance Portability and Accountability Act): US law protecting medical info.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;HITECH&lt;/strong&gt;: Extends HIPAA for electronic health records.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;HITRUST&lt;/strong&gt;: Certification framework normalizing multiple standards (HIPAA, ISO, NIST).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;financial&quot;&gt;Financial&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;SOX&lt;/strong&gt; (Sarbanes-Oxley Act): US federal law mandating financial record keeping. Born from Enron/WorldCom scandals.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;PCI DSS&lt;/strong&gt;: Payment Card Industry Data Security Standard (for handling credit cards).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;general--international&quot;&gt;General / International&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;ISO 27001&lt;/strong&gt;: International standard for Information Security Management Systems (ISMS).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SOC 2&lt;/strong&gt; (Service Organization Control): Audit procedure for service providers ensuring security, availability, processing integrity, confidentiality, and privacy.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GDPR&lt;/strong&gt; (General Data Protection Regulation): EU law on data protection and privacy.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;azure-blueprints&quot;&gt;Azure Blueprints&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Azure Blueprints&lt;/strong&gt; enables you to define a repeatable set of Azure resources that implements and adheres to an organization’s standards, patterns, and requirements.&lt;/p&gt;

&lt;p&gt;It orchestrates:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Role Assignments&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Policy Assignments&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ARM Templates&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource Groups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows you to “stamp out” compliant environments rapidly.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Deployment&lt;/strong&gt;: Move from manual to automated (CI/CD, IaC).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Identity&lt;/strong&gt;: Use RBAC and Azure AD for secure access control.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: Understand the shared responsibility model.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Compliance&lt;/strong&gt;: Adhere to standards (HIPAA, SOX, ISO) relevant to your industry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next chapter, we will dive deep into &lt;strong&gt;Infrastructure as Code&lt;/strong&gt; with Terraform.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>12 Introduction to Cloud Providers</title>
   <link href="https://nglelinh.github.io/contents/en/chapter12/12_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter12/12_Introduction</id>
   <content type="html">&lt;p&gt;This chapter compares the major public cloud providers—AWS, Azure, and Google Cloud Platform (GCP)—and explores their core service offerings.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Compare the “Big 3” cloud providers in terms of global infrastructure and market share&lt;/li&gt;
  &lt;li&gt;Map core services across providers (e.g., EC2 = VM, S3 = Blob Storage)&lt;/li&gt;
  &lt;li&gt;Understand the pricing models and “Free Tier” offerings&lt;/li&gt;
  &lt;li&gt;Analyze the strengths and use cases for each provider&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-cloud-landscape&quot;&gt;The Cloud Landscape&lt;/h2&gt;

&lt;p&gt;While the underlying concepts (virtualization, storage, networking) are the same, each provider wraps them in their own terminology, APIs, and management consoles. A multi-cloud architect must understand these nuances to select the right platform for the job.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>12-01 Cloud Providers - AWS, Azure, and GCP</title>
   <link href="https://nglelinh.github.io/contents/en/chapter12/12_01_Cloud_Providers/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter12/12_01_Cloud_Providers</id>
   <content type="html">&lt;p&gt;Cloud providers offer on-demand computing resources and services that power modern applications. This lecture compares the major cloud providers and their offerings.&lt;/p&gt;

&lt;h2 id=&quot;major-cloud-providers&quot;&gt;Major Cloud Providers&lt;/h2&gt;

&lt;h3 id=&quot;the-big-three&quot;&gt;The Big Three&lt;/h3&gt;

&lt;p&gt;The cloud computing market is dominated by three major providers:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Amazon Web Services (AWS)&lt;/strong&gt; - Market leader&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Microsoft Azure&lt;/strong&gt; - Strong enterprise presence&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Google Cloud Platform (GCP)&lt;/strong&gt; - Innovation leader&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;amazon-web-services-aws&quot;&gt;Amazon Web Services (AWS)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Market Position&lt;/strong&gt;: The largest public cloud by far&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage&lt;/strong&gt;: Powers millions of applications&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Netflix&lt;/strong&gt;: Video streaming infrastructure&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reddit&lt;/strong&gt;: Social platform backend&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Spotify&lt;/strong&gt;: Music streaming services&lt;/li&gt;
  &lt;li&gt;Millions of other applications&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!WARNING]
&lt;strong&gt;Critical Infrastructure&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;When AWS goes down, half of the internet goes down. Example: The infamous S3 outage in February 2017 affected thousands of websites and services.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4 id=&quot;aws-key-services&quot;&gt;AWS Key Services&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;EC2&lt;/strong&gt; (Elastic Compute Cloud): Virtual machines&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lambda&lt;/strong&gt;: Serverless functions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ECS/EKS&lt;/strong&gt;: Container orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;S3&lt;/strong&gt; (Simple Storage Service): Object storage (powers much of the internet)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;EBS&lt;/strong&gt;: Block storage for EC2&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Glacier&lt;/strong&gt;: Archival storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;RDS&lt;/strong&gt;: Relational databases (MySQL, PostgreSQL, etc.)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: NoSQL database&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Aurora&lt;/strong&gt;: High-performance relational database&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Big Data&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;EMR&lt;/strong&gt; (Elastic MapReduce): Managed Hadoop/Spark&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Redshift&lt;/strong&gt;: Data warehouse&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Kinesis&lt;/strong&gt;: Real-time data streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;microsoft-azure&quot;&gt;Microsoft Azure&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Market Position&lt;/strong&gt;: Second largest, strong in enterprise&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strengths&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Integration with Microsoft products (Office 365, Active Directory)&lt;/li&gt;
  &lt;li&gt;Hybrid cloud capabilities&lt;/li&gt;
  &lt;li&gt;Enterprise support&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;azure-key-services&quot;&gt;Azure Key Services&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Virtual Machines&lt;/strong&gt;: Similar to AWS EC2&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;App Service&lt;/strong&gt;: PaaS for web apps&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Functions&lt;/strong&gt;: Serverless&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Azure Storage&lt;/strong&gt;: Blob, file, queue, table storage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Data Lake&lt;/strong&gt;: Big data storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Azure SQL Database&lt;/strong&gt;: Managed SQL Server&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cosmos DB&lt;/strong&gt;: Globally distributed NoSQL&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Database for PostgreSQL/MySQL&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Big Data&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;HDInsight&lt;/strong&gt;: Managed Hadoop/Spark/Kafka&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Synapse Analytics&lt;/strong&gt;: Data warehouse&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Databricks&lt;/strong&gt;: Apache Spark platform&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;google-cloud-platform-gcp&quot;&gt;Google Cloud Platform (GCP)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Market Position&lt;/strong&gt;: Third largest, innovation leader&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strengths&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Advanced data analytics and ML&lt;/li&gt;
  &lt;li&gt;Kubernetes (originated from Google)&lt;/li&gt;
  &lt;li&gt;Global network infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;gcp-key-services&quot;&gt;GCP Key Services&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Compute&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Compute Engine&lt;/strong&gt;: Virtual machines&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cloud Functions&lt;/strong&gt;: Serverless&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GKE&lt;/strong&gt; (Google Kubernetes Engine): Managed Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;: Object storage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Persistent Disk&lt;/strong&gt;: Block storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Cloud SQL&lt;/strong&gt;: Managed MySQL/PostgreSQL&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cloud Spanner&lt;/strong&gt;: Globally distributed database&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;BigTable&lt;/strong&gt;: NoSQL for big data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;: Serverless data warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Big Data&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Dataproc&lt;/strong&gt;: Managed Hadoop/Spark&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Dataflow&lt;/strong&gt;: Stream and batch processing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pub/Sub&lt;/strong&gt;: Messaging service&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;feature-parity&quot;&gt;Feature Parity&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: All clouds try to compete on features, so they all end up having extremely similar feature sets.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Service Type&lt;/th&gt;
      &lt;th&gt;AWS&lt;/th&gt;
      &lt;th&gt;Azure&lt;/th&gt;
      &lt;th&gt;GCP&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;VMs&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;EC2&lt;/td&gt;
      &lt;td&gt;Virtual Machines&lt;/td&gt;
      &lt;td&gt;Compute Engine&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Serverless&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Lambda&lt;/td&gt;
      &lt;td&gt;Functions&lt;/td&gt;
      &lt;td&gt;Cloud Functions&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Containers&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;ECS/EKS&lt;/td&gt;
      &lt;td&gt;AKS&lt;/td&gt;
      &lt;td&gt;GKE&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Object Storage&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;S3&lt;/td&gt;
      &lt;td&gt;Blob Storage&lt;/td&gt;
      &lt;td&gt;Cloud Storage&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;NoSQL&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;DynamoDB&lt;/td&gt;
      &lt;td&gt;Cosmos DB&lt;/td&gt;
      &lt;td&gt;BigTable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Data Warehouse&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Redshift&lt;/td&gt;
      &lt;td&gt;Synapse&lt;/td&gt;
      &lt;td&gt;BigQuery&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Big Data&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;EMR&lt;/td&gt;
      &lt;td&gt;HDInsight&lt;/td&gt;
      &lt;td&gt;Dataproc&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;virtual-machines-comparison&quot;&gt;Virtual Machines Comparison&lt;/h2&gt;

&lt;h3 id=&quot;aws-ec2&quot;&gt;AWS EC2&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Range&lt;/strong&gt;: From tiny to gigantic&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;T2.Nano&lt;/strong&gt;: 1 vCPU, 512 MB RAM&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;X1.32xlarge&lt;/strong&gt;: 128 vCPU, 2000 GB RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Features&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ GPUs available (useful for deep learning)&lt;/li&gt;
  &lt;li&gt;✓ Priced per-second&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;On-Demand&lt;/strong&gt;: Pay for what you use&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;Spot Instances&lt;/strong&gt;: Auction for unused capacity (much cheaper)
    &lt;ul&gt;
      &lt;li&gt;Caveat: VM may be shut down with short notice&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;azure-virtual-machines&quot;&gt;Azure Virtual Machines&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Similar to AWS&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ GPUs available&lt;/li&gt;
  &lt;li&gt;⚠ Max 32 vCPUs (currently)&lt;/li&gt;
  &lt;li&gt;⚠ Max 800 GB RAM (currently)&lt;/li&gt;
  &lt;li&gt;Note: Most applications won’t hit these limits&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;google-compute-engine&quot;&gt;Google Compute Engine&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Features&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ Largest: 96 vCPU, 624 GB RAM&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;Custom-sized machines&lt;/strong&gt; (unique feature)&lt;/li&gt;
  &lt;li&gt;✓ Per-second billing&lt;/li&gt;
  &lt;li&gt;✓ Sustained use discounts&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;storage-services&quot;&gt;Storage Services&lt;/h2&gt;

&lt;h3 id=&quot;object-storage&quot;&gt;Object Storage&lt;/h3&gt;

&lt;p&gt;All three providers offer massive-scale object storage:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS S3&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Most widely used&lt;/li&gt;
  &lt;li&gt;Powers much of the internet (e.g., Imgur)&lt;/li&gt;
  &lt;li&gt;11 9’s of durability (99.999999999%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Azure Blob Storage&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Similar to S3&lt;/li&gt;
  &lt;li&gt;Integrated with Azure ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Google Cloud Storage&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Similar to S3&lt;/li&gt;
  &lt;li&gt;Multiple storage classes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;use-cases&quot;&gt;Use Cases&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Web Assets → S3/Blob/GCS
Backups → Glacier/Archive/Coldline
Big Data → Data Lake on S3/ADLS/GCS
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;hosted-data-processing&quot;&gt;Hosted Data Processing&lt;/h2&gt;

&lt;p&gt;All providers offer managed big data services:&lt;/p&gt;

&lt;h3 id=&quot;amazon-emr&quot;&gt;Amazon EMR&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Managed Hadoop, Spark, HBase, Presto, Hive&lt;/li&gt;
  &lt;li&gt;Automatic cluster scaling and provisioning&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;microsoft-hdinsight&quot;&gt;Microsoft HDInsight&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Managed Hadoop, Spark, Kafka, HBase&lt;/li&gt;
  &lt;li&gt;Integration with Azure services&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;google-dataproc&quot;&gt;Google Dataproc&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Managed Hadoop and Spark&lt;/li&gt;
  &lt;li&gt;Fast cluster creation (90 seconds)&lt;/li&gt;
  &lt;li&gt;Per-second billing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;database-services&quot;&gt;Database Services&lt;/h2&gt;

&lt;h3 id=&quot;managed-relational-databases&quot;&gt;Managed Relational Databases&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AWS RDS&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;MySQL, PostgreSQL, MariaDB, Oracle, SQL Server&lt;/li&gt;
  &lt;li&gt;Aurora (AWS proprietary, MySQL/PostgreSQL compatible)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Azure&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Azure SQL Database (SQL Server)&lt;/li&gt;
  &lt;li&gt;Azure Database for MySQL/PostgreSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GCP&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Cloud SQL (MySQL, PostgreSQL, SQL Server)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;nosql-databases&quot;&gt;NoSQL Databases&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AWS DynamoDB&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Key-value and document database&lt;/li&gt;
  &lt;li&gt;Serverless, auto-scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Azure Cosmos DB&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Multi-model database&lt;/li&gt;
  &lt;li&gt;Global distribution&lt;/li&gt;
  &lt;li&gt;Multiple consistency levels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GCP BigTable&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Wide-column NoSQL&lt;/li&gt;
  &lt;li&gt;Powers Google Search, Maps, Gmail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GCP BigQuery&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Serverless data warehouse&lt;/li&gt;
  &lt;li&gt;Petabyte-scale analytics&lt;/li&gt;
  &lt;li&gt;SQL interface&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;cloud-regions-and-availability&quot;&gt;Cloud Regions and Availability&lt;/h2&gt;

&lt;h3 id=&quot;azure-regions&quot;&gt;Azure Regions&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;60+ Azure Regions&lt;/strong&gt; worldwide&lt;/li&gt;
  &lt;li&gt;Most extensive global presence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Explore&lt;/strong&gt;: &lt;a href=&quot;https://datacenters.microsoft.com/globe/explore/&quot;&gt;Azure Datacenter Map&lt;/a&gt;&lt;/p&gt;

&lt;h3 id=&quot;reasons-to-select-a-region&quot;&gt;Reasons to Select a Region&lt;/h3&gt;

&lt;h4 id=&quot;1-cost&quot;&gt;1. Cost&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Egress fees&lt;/strong&gt;: Moving data between regions costs money&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Regional pricing&lt;/strong&gt;: Some services cheaper in certain regions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tip&lt;/strong&gt;: Use pricing calculators to estimate costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-available-resources&quot;&gt;2. Available Resources&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Not all services available in all regions&lt;/li&gt;
  &lt;li&gt;Check “Products by Region” documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-security-and-compliance&quot;&gt;3. Security and Compliance&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Government clouds&lt;/strong&gt;: Special regions for government data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data sovereignty&lt;/strong&gt;: Keep data in specific countries&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Compliance&lt;/strong&gt;: GDPR, HIPAA, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-speed-latency&quot;&gt;4. Speed (Latency)&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Pick closest region&lt;/strong&gt; to your users&lt;/li&gt;
  &lt;li&gt;Lower latency = better user experience&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;availability-zones&quot;&gt;Availability Zones&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Definition&lt;/strong&gt;: Physically separate locations within each region&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Tolerance to local failures&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Earthquakes&lt;/li&gt;
  &lt;li&gt;Floods&lt;/li&gt;
  &lt;li&gt;Fires&lt;/li&gt;
  &lt;li&gt;Power outages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Redundancy and logical isolation&lt;/li&gt;
  &lt;li&gt;High-performance network between zones&lt;/li&gt;
  &lt;li&gt;Round-trip latency &amp;lt; 2ms (Azure)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Region: East US                    │
│  ┌───────────┐  ┌───────────┐      │
│  │  Zone 1   │  │  Zone 2   │      │
│  │ (DC 1,2)  │  │ (DC 3,4)  │      │
│  └───────────┘  └───────────┘      │
│  ┌───────────┐                      │
│  │  Zone 3   │                      │
│  │ (DC 5,6)  │                      │
│  └───────────┘                      │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;unique-features&quot;&gt;Unique Features&lt;/h2&gt;

&lt;h3 id=&quot;google-cloud-platform&quot;&gt;Google Cloud Platform&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cloud Spanner&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Planet-scale distributed database&lt;/li&gt;
  &lt;li&gt;Strong consistency (CP system)&lt;/li&gt;
  &lt;li&gt;Horizontal scalability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tensor Processing Unit (TPU)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Custom hardware for deep learning&lt;/li&gt;
  &lt;li&gt;Faster than GPUs for specific workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Serverless data warehouse&lt;/li&gt;
  &lt;li&gt;Analyze petabytes in seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;aws&quot;&gt;AWS&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Absurdly large feature set&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;200+ services&lt;/li&gt;
  &lt;li&gt;Most mature ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FPGAs&lt;/strong&gt; (Field-Programmable Gate Arrays):&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Custom hardware acceleration&lt;/li&gt;
  &lt;li&gt;F1 instances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Global Infrastructure&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Most regions and availability zones&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;azure&quot;&gt;Azure&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Cloud&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Azure Stack (on-premises Azure)&lt;/li&gt;
  &lt;li&gt;Azure Arc (manage resources anywhere)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Enterprise Integration&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Active Directory integration&lt;/li&gt;
  &lt;li&gt;Office 365 integration&lt;/li&gt;
  &lt;li&gt;Strong Windows support&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;cloud-security&quot;&gt;Cloud Security&lt;/h2&gt;

&lt;h3 id=&quot;key-security-features&quot;&gt;Key Security Features&lt;/h3&gt;

&lt;h4 id=&quot;data-storage&quot;&gt;Data Storage&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Encryption at rest&lt;/strong&gt;: All providers encrypt stored data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Encryption in transit&lt;/strong&gt;: HTTPS/TLS for data transfer&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Regulatory standards&lt;/strong&gt;: Compliance with GDPR, HIPAA, SOC 2, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;compliance&quot;&gt;Compliance&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Certifications&lt;/strong&gt;: ISO 27001, PCI DSS, FedRAMP&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Audit logs&lt;/strong&gt;: Track all access and changes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data residency&lt;/strong&gt;: Control where data is stored&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;data-migration&quot;&gt;Data Migration&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Secure transfer&lt;/strong&gt;: How to move sensitive data?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Physical transfer&lt;/strong&gt;: AWS Snowball, Azure Data Box&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Network transfer&lt;/strong&gt;: VPN, Direct Connect, ExpressRoute&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;cloud-permissions&quot;&gt;Cloud Permissions&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;IAM&lt;/strong&gt; (Identity and Access Management)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Role-based access control&lt;/strong&gt; (RBAC)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Principle of least privilege&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Example&lt;/strong&gt;: Students don’t get sudo access!&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;ddos-mitigation&quot;&gt;DDoS Mitigation&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;AWS Shield&lt;/strong&gt;: DDoS protection&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure DDoS Protection&lt;/strong&gt;: Network security&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GCP Cloud Armor&lt;/strong&gt;: Web application firewall&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;high-scalability&quot;&gt;High Scalability&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt; with security settings&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balancing&lt;/strong&gt; across zones&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Redundancy&lt;/strong&gt; for high availability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;top-benefits-of-cloud-computing&quot;&gt;Top Benefits of Cloud Computing&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Pay only for what you use&lt;/li&gt;
      &lt;li&gt;No upfront capital expenditure&lt;/li&gt;
      &lt;li&gt;Economies of scale&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Scale up/down on demand&lt;/li&gt;
      &lt;li&gt;Handle traffic spikes&lt;/li&gt;
      &lt;li&gt;Global reach&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reliability&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;High availability (99.9%+ uptime)&lt;/li&gt;
      &lt;li&gt;Disaster recovery&lt;/li&gt;
      &lt;li&gt;Automatic backups&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Performance&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Latest hardware&lt;/li&gt;
      &lt;li&gt;Global CDN&lt;/li&gt;
      &lt;li&gt;Low latency&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Professional security teams&lt;/li&gt;
      &lt;li&gt;Compliance certifications&lt;/li&gt;
      &lt;li&gt;Regular updates&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Innovation&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Access to latest technologies&lt;/li&gt;
      &lt;li&gt;AI/ML services&lt;/li&gt;
      &lt;li&gt;Managed services&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;choosing-a-cloud-provider&quot;&gt;Choosing a Cloud Provider&lt;/h2&gt;

&lt;h3 id=&quot;decision-factors&quot;&gt;Decision Factors&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Existing Infrastructure&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Already using Microsoft? → Azure&lt;/li&gt;
  &lt;li&gt;Already using Google Workspace? → GCP&lt;/li&gt;
  &lt;li&gt;Need broadest service selection? → AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Specific Requirements&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Best data analytics? → GCP (BigQuery)&lt;/li&gt;
  &lt;li&gt;Best enterprise integration? → Azure&lt;/li&gt;
  &lt;li&gt;Most mature ecosystem? → AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Use pricing calculators&lt;/li&gt;
  &lt;li&gt;Consider egress fees&lt;/li&gt;
  &lt;li&gt;Look for committed use discounts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Skills&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Team expertise&lt;/li&gt;
  &lt;li&gt;Training availability&lt;/li&gt;
  &lt;li&gt;Community support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;multi-cloud-strategy&quot;&gt;Multi-Cloud Strategy&lt;/h3&gt;

&lt;p&gt;Many organizations use multiple clouds:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ Avoid vendor lock-in&lt;/li&gt;
  &lt;li&gt;✓ Use best service from each provider&lt;/li&gt;
  &lt;li&gt;✓ Geographic coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;⚠ Increased complexity&lt;/li&gt;
  &lt;li&gt;⚠ Higher management overhead&lt;/li&gt;
  &lt;li&gt;⚠ Data transfer costs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Cloud providers offer similar core services with unique strengths:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;AWS&lt;/strong&gt;: Largest, most mature, broadest service selection&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure&lt;/strong&gt;: Best for enterprises, hybrid cloud, Microsoft integration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GCP&lt;/strong&gt;: Innovation leader, best data analytics, Kubernetes expertise&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Considerations&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Feature parity across providers&lt;/li&gt;
  &lt;li&gt;Regional availability and compliance&lt;/li&gt;
  &lt;li&gt;Security and compliance features&lt;/li&gt;
  &lt;li&gt;Cost optimization strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose based on your specific requirements, existing infrastructure, and team expertise.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://aws.amazon.com/products/&quot;&gt;AWS Services Overview&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://azure.microsoft.com/en-us/services/&quot;&gt;Azure Services&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cloud.google.com/products&quot;&gt;GCP Products&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://cloud.google.com/docs/compare/aws&quot;&gt;Cloud Comparison&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>Chapter 11 - Data Sourcing and Cleaning</title>
   <link href="https://nglelinh.github.io/contents/en/chapter11/11_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter11/11_Introduction</id>
   <content type="html">&lt;p&gt;Welcome to Chapter 11: &lt;strong&gt;Data Sourcing and Cleaning&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;chapter-overview&quot;&gt;Chapter Overview&lt;/h2&gt;

&lt;p&gt;Data quality is critical for successful analytics and machine learning. This chapter covers the essential processes of acquiring data from various sources and preparing it for analysis through cleaning and transformation.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;By the end of this chapter, you will be able to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Understand data sourcing&lt;/strong&gt; strategies and techniques&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Identify data quality issues&lt;/strong&gt; and their impact&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Apply data cleaning techniques&lt;/strong&gt; to improve data quality&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Transform data&lt;/strong&gt; into suitable formats for analysis&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Implement data validation&lt;/strong&gt; and quality checks&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Design data preparation pipelines&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;topics-covered&quot;&gt;Topics Covered&lt;/h2&gt;

&lt;h3 id=&quot;1-data-sourcing&quot;&gt;1. Data Sourcing&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Data acquisition strategies&lt;/li&gt;
  &lt;li&gt;Internal vs external data sources&lt;/li&gt;
  &lt;li&gt;APIs and web scraping&lt;/li&gt;
  &lt;li&gt;Data integration challenges&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-data-quality-issues&quot;&gt;2. Data Quality Issues&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Missing data&lt;/li&gt;
  &lt;li&gt;Inconsistent formats&lt;/li&gt;
  &lt;li&gt;Duplicate records&lt;/li&gt;
  &lt;li&gt;Outliers and anomalies&lt;/li&gt;
  &lt;li&gt;Data type mismatches&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-data-cleaning-techniques&quot;&gt;3. Data Cleaning Techniques&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Handling missing values&lt;/li&gt;
  &lt;li&gt;Removing duplicates&lt;/li&gt;
  &lt;li&gt;Standardizing formats&lt;/li&gt;
  &lt;li&gt;Correcting errors&lt;/li&gt;
  &lt;li&gt;Dealing with outliers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-data-transformation&quot;&gt;4. Data Transformation&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Normalization and scaling&lt;/li&gt;
  &lt;li&gt;Encoding categorical variables&lt;/li&gt;
  &lt;li&gt;Feature engineering&lt;/li&gt;
  &lt;li&gt;Data aggregation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-data-validation&quot;&gt;5. Data Validation&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Schema validation&lt;/li&gt;
  &lt;li&gt;Business rule validation&lt;/li&gt;
  &lt;li&gt;Data profiling&lt;/li&gt;
  &lt;li&gt;Quality metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-data-cleaning-matters&quot;&gt;Why Data Cleaning Matters&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;“Garbage in, garbage out”&lt;/strong&gt; - Poor data quality leads to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Inaccurate insights&lt;/strong&gt;: Wrong decisions based on bad data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Failed ML models&lt;/strong&gt;: Models trained on dirty data perform poorly&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Wasted resources&lt;/strong&gt;: Time spent fixing issues downstream&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lost opportunities&lt;/strong&gt;: Missing valuable patterns in noisy data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Industry estimate&lt;/strong&gt;: Data scientists spend 60-80% of their time on data preparation!&lt;/p&gt;

&lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;

&lt;p&gt;To get the most out of this chapter, you should understand:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Basic statistics and data analysis&lt;/li&gt;
  &lt;li&gt;Programming (Python/SQL)&lt;/li&gt;
  &lt;li&gt;Database concepts&lt;/li&gt;
  &lt;li&gt;Data pipeline fundamentals from Chapter 10&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;real-world-impact&quot;&gt;Real-World Impact&lt;/h2&gt;

&lt;p&gt;Clean data is essential for:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Healthcare&lt;/strong&gt;: Accurate patient records save lives&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Finance&lt;/strong&gt;: Correct transaction data prevents fraud&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;E-commerce&lt;/strong&gt;: Clean customer data improves recommendations&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Manufacturing&lt;/strong&gt;: Quality sensor data enables predictive maintenance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s learn how to source and clean data effectively!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>11-01 Data Sourcing, Cleaning, and Preparation</title>
   <link href="https://nglelinh.github.io/contents/en/chapter11/11_01_Data_Sourcing_Cleaning/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter11/11_01_Data_Sourcing_Cleaning</id>
   <content type="html">&lt;p&gt;Data preparation is a critical step in any data analytics or machine learning project. This lecture covers the essential techniques for sourcing data from various sources and cleaning it for analysis.&lt;/p&gt;

&lt;h2 id=&quot;data-sourcing&quot;&gt;Data Sourcing&lt;/h2&gt;

&lt;h3 id=&quot;data-licensing&quot;&gt;Data Licensing&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!WARNING]
&lt;strong&gt;Can you use any data?&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;Always check data sources for restrictions before use!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Common licenses:&lt;/p&gt;

&lt;h4 id=&quot;mit-license&quot;&gt;MIT License&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Very permissive&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Can use commercially&lt;/li&gt;
  &lt;li&gt;Must keep license and copyright notice&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;gpl-v2v3&quot;&gt;GPL (v2/v3)&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Copyleft license&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Must distribute source code of anything built with it&lt;/li&gt;
  &lt;li&gt;Must include copyright, license, link to original, and change details&lt;/li&gt;
  &lt;li&gt;“Spreads virally” - anything using it must become GPL&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;other-considerations&quot;&gt;Other Considerations&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Commercial use restrictions&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Foreign use limitations&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Attribution requirements&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data privacy regulations&lt;/strong&gt; (GDPR, CCPA)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;types-of-data-sources&quot;&gt;Types of Data Sources&lt;/h3&gt;

&lt;h4 id=&quot;1-refined-datasets&quot;&gt;1. Refined Datasets&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Academic or governmental datasets&lt;/li&gt;
  &lt;li&gt;Released to the public&lt;/li&gt;
  &lt;li&gt;Pre-cleaned and formatted&lt;/li&gt;
  &lt;li&gt;Table-like format (CSV, databases)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ Missing values handled&lt;/li&gt;
  &lt;li&gt;✓ Parsing done&lt;/li&gt;
  &lt;li&gt;✓ Easier to download&lt;/li&gt;
  &lt;li&gt;✓ Examples from other papers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Examples&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;MNIST&lt;/strong&gt;: Handwritten digits dataset&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ImageNet&lt;/strong&gt;: Image classification dataset&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;UCI Machine Learning Repository&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Kaggle Datasets&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Government open data portals&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-raw-data&quot;&gt;2. Raw Data&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Unprocessed data from original sources&lt;/li&gt;
  &lt;li&gt;May have missing values, formatting issues&lt;/li&gt;
  &lt;li&gt;Requires significant cleaning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Social media (Twitter, Facebook)&lt;/li&gt;
  &lt;li&gt;Sensor data (IoT devices)&lt;/li&gt;
  &lt;li&gt;Scientific instruments&lt;/li&gt;
  &lt;li&gt;Web scraping&lt;/li&gt;
  &lt;li&gt;APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;⚠ Missing values&lt;/li&gt;
  &lt;li&gt;⚠ Inconsistent formatting&lt;/li&gt;
  &lt;li&gt;⚠ Non-normalized data&lt;/li&gt;
  &lt;li&gt;⚠ Requires extensive cleaning&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;web-scraping&quot;&gt;Web Scraping&lt;/h2&gt;

&lt;h3 id=&quot;process&quot;&gt;Process&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Use headless browser&lt;/strong&gt; to download JavaScript objects/HTML&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Parse HTML&lt;/strong&gt; to extract relevant data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Save in structured format&lt;/strong&gt; (CSV, JSON, database)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;example-with-beautifulsoup&quot;&gt;Example with BeautifulSoup&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bs4&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BeautifulSoup&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;html_content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;
&amp;lt;html&amp;gt;&amp;lt;body&amp;gt;&amp;lt;ul&amp;gt;
  &amp;lt;li class=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;shoe-item&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;gt;Air Jordans&amp;lt;/li&amp;gt;
  &amp;lt;li class=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;shoe-item&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;gt;Light up Sketchers&amp;lt;/li&amp;gt;
&amp;lt;/ul&amp;gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;
&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BeautifulSoup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html_content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;html.parser&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;find_all&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;shoe-item&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}):&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Output:
# Air Jordans
# Light up Sketchers
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;web-scraping-challenges&quot;&gt;Web Scraping Challenges&lt;/h3&gt;

&lt;p&gt;⚠ &lt;strong&gt;Rate limiting&lt;/strong&gt;: Servers may block excessive requests&lt;br /&gt;
⚠ &lt;strong&gt;IP blocking&lt;/strong&gt;: May get banned for aggressive scraping&lt;br /&gt;
⚠ &lt;strong&gt;Legal issues&lt;/strong&gt;: May violate terms of service&lt;br /&gt;
⚠ &lt;strong&gt;Fragile code&lt;/strong&gt;: Breaks when HTML structure changes&lt;br /&gt;
⚠ &lt;strong&gt;Dynamic content&lt;/strong&gt;: JavaScript-rendered content requires special handling&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Use official APIs when available instead of scraping&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;data-cleaning&quot;&gt;Data Cleaning&lt;/h2&gt;

&lt;h3 id=&quot;why-clean-data&quot;&gt;Why Clean Data?&lt;/h3&gt;

&lt;p&gt;Data quality issues include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Missing values&lt;/strong&gt;: Incomplete records&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Duplicate values&lt;/strong&gt;: Repeated entries&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Invalid values&lt;/strong&gt;: Data outside expected range&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Useless values&lt;/strong&gt;: Irrelevant information&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Inconsistent formats&lt;/strong&gt;: Different representations of same data&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Ensure data is as representative and accurate as possible for machine learning algorithms&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;data-normalization&quot;&gt;Data Normalization&lt;/h3&gt;

&lt;p&gt;Data should be in a normalized format:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Before:                    After:
Name: &quot;John Smith&quot;         first_name: &quot;John&quot;
Age: &quot;25 years&quot;           last_name: &quot;Smith&quot;
City: &quot;NYC&quot;               age: 25
                          city: &quot;New York City&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;handling-missing-values&quot;&gt;Handling Missing Values&lt;/h2&gt;

&lt;h3 id=&quot;strategies&quot;&gt;Strategies&lt;/h3&gt;

&lt;h4 id=&quot;1-drop-data&quot;&gt;1. Drop Data&lt;/h4&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Drop rows with any missing values
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dropna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Drop rows where specific column is missing
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dropna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Drop columns with too many missing values
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dropna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;thresh&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Considerations&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;⚠ May skew dataset&lt;/li&gt;
  &lt;li&gt;⚠ Loss of information&lt;/li&gt;
  &lt;li&gt;✓ Simple and fast&lt;/li&gt;
  &lt;li&gt;✓ Works when few missing values&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-imputation---fill-with-meanmedian&quot;&gt;2. Imputation - Fill with Mean/Median&lt;/h4&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Fill with mean
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Fill with median (better for skewed data)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;salary&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;salary&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;median&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Fill with mode (for categorical data)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;category&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;category&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Considerations&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ Preserves dataset size&lt;/li&gt;
  &lt;li&gt;⚠ Introduces bias&lt;/li&gt;
  &lt;li&gt;⚠ Reduces variance&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-forwardbackward-fill&quot;&gt;3. Forward/Backward Fill&lt;/h4&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Forward fill (use previous value)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ffill&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Backward fill (use next value)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;temperature&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;bfill&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Time series data&lt;/p&gt;

&lt;h4 id=&quot;4-interpolation&quot;&gt;4. Interpolation&lt;/h4&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Linear interpolation
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;interpolate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;linear&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Polynomial interpolation
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;interpolate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;method&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;polynomial&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;order&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;5-probabilistic-sampling&quot;&gt;5. Probabilistic Sampling&lt;/h4&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Sample from distribution
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;missing_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isnull&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isnull&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;normal&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;missing_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;6-leave-as-is&quot;&gt;6. Leave As Is&lt;/h4&gt;

&lt;p&gt;Some algorithms handle missing values:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Decision trees&lt;/li&gt;
  &lt;li&gt;Random forests&lt;/li&gt;
  &lt;li&gt;XGBoost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantage&lt;/strong&gt;: Minimal bias introduction&lt;/p&gt;

&lt;h2 id=&quot;handling-duplicates&quot;&gt;Handling Duplicates&lt;/h2&gt;

&lt;h3 id=&quot;identifying-duplicates&quot;&gt;Identifying Duplicates&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Find duplicate rows
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;duplicates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;duplicated&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Find duplicates based on specific columns
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;duplicates&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;duplicated&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# View duplicate rows
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;duplicated&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;removing-duplicates&quot;&gt;Removing Duplicates&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Remove all duplicates
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;drop_duplicates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Keep first occurrence
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;drop_duplicates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;first&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Keep last occurrence
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;drop_duplicates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Remove based on specific columns
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;drop_duplicates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;considerations&quot;&gt;Considerations&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!IMPORTANT]
&lt;strong&gt;Duplicates can have meaning!&lt;/strong&gt;&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;Multiple purchases by same customer&lt;/li&gt;
    &lt;li&gt;Repeated sensor readings&lt;/li&gt;
    &lt;li&gt;Time-series data&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Questions to ask&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;What defines a duplicate? (Exact match? Close match?)&lt;/li&gt;
  &lt;li&gt;Should we keep any duplicates?&lt;/li&gt;
  &lt;li&gt;Is there temporal ordering to consider?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;handling-invalid-and-useless-values&quot;&gt;Handling Invalid and Useless Values&lt;/h2&gt;

&lt;h3 id=&quot;invalid-values&quot;&gt;Invalid Values&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem domain specifies what is valid&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Example: Age must be positive
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Example: Percentage must be 0-100
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;percentage&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;percentage&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Example: Date must be in past
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;outliers&quot;&gt;Outliers&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Z-score method
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scipy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;z_scores&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;abs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;zscore&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;z_scores&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# IQR method
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Q1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;quantile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.25&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Q3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;quantile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.75&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;IQR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Q3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Q1&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df_clean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Q1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IQR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; 
              &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Q3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IQR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;useless-values&quot;&gt;Useless Values&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;: Keep them around - they may be useful later!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
  &lt;li&gt;Columns with single unique value&lt;/li&gt;
  &lt;li&gt;Highly correlated features&lt;/li&gt;
  &lt;li&gt;Irrelevant to problem domain&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;data-processing-pipelines&quot;&gt;Data Processing Pipelines&lt;/h2&gt;

&lt;h3 id=&quot;pipeline-stages&quot;&gt;Pipeline Stages&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────┐
│ Data Source │
└──────┬──────┘
       │
┌──────▼──────┐
│  Extract    │ (Get data from source)
└──────┬──────┘
       │
┌──────▼──────┐
│ Transform   │ (Clean, normalize, enrich)
└──────┬──────┘
       │
┌──────▼──────┐
│   Load      │ (Store in destination)
└──────┬──────┘
       │
┌──────▼──────┐
│  Validate   │ (Check quality)
└─────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;example-pipeline-with-pandas&quot;&gt;Example Pipeline with Pandas&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;data_pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# 1. Extract
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# 2. Transform
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Handle missing values
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;median&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;category&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Unknown&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inplace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Remove duplicates
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;drop_duplicates&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Handle invalid values
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;120&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Normalize formats
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;lower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;strip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;phone&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;phone&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;].&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;\D&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regex&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# 3. Load
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;to_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# 4. Validate
&lt;/span&gt;    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Original rows: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Cleaned rows: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Missing values:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isnull&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Run pipeline
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clean_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;data_pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;raw_data.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;clean_data.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;loading-data-into-processing-frameworks&quot;&gt;Loading Data into Processing Frameworks&lt;/h2&gt;

&lt;h3 id=&quot;hadoop--spark&quot;&gt;Hadoop / Spark&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Copy to HDFS&lt;/span&gt;
hdfs dfs &lt;span class=&quot;nt&quot;&gt;-copyFromLocal&lt;/span&gt; local_file.csv /data/input/

&lt;span class=&quot;c&quot;&gt;# Or use Spark&lt;/span&gt;
spark-submit &lt;span class=&quot;nt&quot;&gt;--master&lt;/span&gt; yarn load_data.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;sql-databases&quot;&gt;SQL Databases&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sqlalchemy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create_engine&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Create connection
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;engine&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;create_engine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;postgresql://user:pass@localhost/db&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Load data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;to_sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;table_name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;engine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;if_exists&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;cloud-storage&quot;&gt;Cloud Storage&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# AWS S3
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;boto3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;s3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;boto3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;client&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s3&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;s3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upload_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;local_file.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;bucket-name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data/file.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Google Cloud Storage
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;google.cloud&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;storage&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;client&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;storage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Client&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bucket&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;client&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;bucket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;bucket-name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;blob&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bucket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;blob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data/file.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;blob&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upload_from_filename&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;local_file.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;large-data-transfer&quot;&gt;Large Data Transfer&lt;/h3&gt;

&lt;p&gt;For very large datasets (&amp;gt; 1 PB):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;AWS Snowball&lt;/strong&gt;: Physical device for bulk data transfer&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure Data Box&lt;/strong&gt;: Microsoft’s physical transfer solution&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Google Transfer Appliance&lt;/strong&gt;: Google’s hardware solution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;best-practices&quot;&gt;Best Practices&lt;/h2&gt;

&lt;h3 id=&quot;1-document-everything&quot;&gt;1. Document Everything&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Keep track of transformations
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cleaning_log&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;missing_values_filled&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;salary&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;duplicates_removed&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;150&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;outliers_removed&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;23&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;invalid_values_removed&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-validate-at-each-step&quot;&gt;2. Validate at Each Step&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;validate_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;=== Validation at &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stage&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; ===&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Shape: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Missing values:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isnull&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Duplicates: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;duplicated&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Data types:&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtypes&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-keep-original-data&quot;&gt;3. Keep Original Data&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Never modify original
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_original&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;read_csv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data.csv&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df_working&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df_original&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;copy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Work on copy
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_working&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;clean_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df_working&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;4-automate-repetitive-tasks&quot;&gt;4. Automate Repetitive Tasks&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;clean_column&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Reusable cleaning function&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strategy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strategy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;median&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;median&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;strategy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fillna&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;series&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Data sourcing and cleaning are critical steps in the data pipeline:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Sourcing&lt;/strong&gt;: Obtain data from refined datasets, APIs, or web scraping&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Licensing&lt;/strong&gt;: Always check data usage rights&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Missing values&lt;/strong&gt;: Drop, impute, or leave based on context&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Duplicates&lt;/strong&gt;: Remove carefully, considering domain meaning&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Invalid values&lt;/strong&gt;: Filter based on business rules&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pipelines&lt;/strong&gt;: Automate ETL processes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Validation&lt;/strong&gt;: Check quality at each step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Remember&lt;/strong&gt;: Data scientists spend 60-80% of their time on data preparation - it’s worth doing well!&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://pandas.pydata.org/docs/&quot;&gt;Pandas Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.crummy.com/software/BeautifulSoup/bs4/doc/&quot;&gt;BeautifulSoup Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://realpython.com/python-data-cleaning-numpy-pandas/&quot;&gt;Data Cleaning with Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>Chapter 10 - Big Data Platforms and Processing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter10/10_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter10/10_Introduction</id>
   <content type="html">&lt;p&gt;Welcome to Chapter 10: &lt;strong&gt;Big Data Platforms and Processing&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;chapter-overview&quot;&gt;Chapter Overview&lt;/h2&gt;

&lt;p&gt;Big Data has transformed how organizations collect, store, process, and analyze information. This chapter explores the fundamentals of big data, the platforms that enable big data processing, and the architectural patterns used to build scalable data systems.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;By the end of this chapter, you will be able to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Understand “big data”&lt;/strong&gt; and its characteristics (Volume, Variety, Velocity, Veracity)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Explain big data platforms&lt;/strong&gt; and their role in modern data ecosystems&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Design big data architectures&lt;/strong&gt; including data pipelines and processing workflows&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Differentiate between operational and analytical data&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Understand data governance&lt;/strong&gt; and quality concerns at scale&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Work with big data processing models&lt;/strong&gt; (batch, streaming, real-time)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;topics-covered&quot;&gt;Topics Covered&lt;/h2&gt;

&lt;h3 id=&quot;1-what-is-big-data&quot;&gt;1. What is Big Data?&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Characteristics: Volume, Variety, Velocity, Veracity&lt;/li&gt;
  &lt;li&gt;Sources of big data (IoT, social media, scientific instruments)&lt;/li&gt;
  &lt;li&gt;Operational vs analytical data&lt;/li&gt;
  &lt;li&gt;The value of data and data economy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-big-data-platforms&quot;&gt;2. Big Data Platforms&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Platform definition and characteristics&lt;/li&gt;
  &lt;li&gt;Data as asset vs data as product&lt;/li&gt;
  &lt;li&gt;Onion architecture for big data platforms&lt;/li&gt;
  &lt;li&gt;Core services: ingestion, storage, processing, querying&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-big-data-architectures&quot;&gt;3. Big Data Architectures&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Data-centric virtualized infrastructures&lt;/li&gt;
  &lt;li&gt;Middleware platforms&lt;/li&gt;
  &lt;li&gt;Big data services and applications&lt;/li&gt;
  &lt;li&gt;Multi-cloud and hybrid deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-data-pipelines&quot;&gt;4. Data Pipelines&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Data ingestion and ETL&lt;/li&gt;
  &lt;li&gt;Data storage and management&lt;/li&gt;
  &lt;li&gt;Data analysis and machine learning&lt;/li&gt;
  &lt;li&gt;Reporting and visualization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-data-governance&quot;&gt;5. Data Governance&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Data quality and lineage&lt;/li&gt;
  &lt;li&gt;Multi-tenancy and SLAs&lt;/li&gt;
  &lt;li&gt;Privacy and security concerns&lt;/li&gt;
  &lt;li&gt;Compliance (GDPR, regulations)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;6-processing-models&quot;&gt;6. Processing Models&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Batch processing&lt;/li&gt;
  &lt;li&gt;Stream processing&lt;/li&gt;
  &lt;li&gt;Real-time analytics&lt;/li&gt;
  &lt;li&gt;Lambda and Kappa architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-big-data-matters&quot;&gt;Why Big Data Matters&lt;/h2&gt;

&lt;p&gt;Big Data is fundamental to modern applications:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;AI and Machine Learning&lt;/strong&gt;: “No AI Without Data” - data is the fuel for ML models&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Business Intelligence&lt;/strong&gt;: 360-degree customer analytics&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;IoT and Industry 4.0&lt;/strong&gt;: Real-time monitoring and predictive maintenance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scientific Research&lt;/strong&gt;: Earth observation, healthcare, genomics&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Smart Cities&lt;/strong&gt;: Sustainability and urban optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;

&lt;p&gt;To get the most out of this chapter, you should understand:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Distributed systems concepts (from previous chapters)&lt;/li&gt;
  &lt;li&gt;Cloud computing fundamentals&lt;/li&gt;
  &lt;li&gt;Database basics&lt;/li&gt;
  &lt;li&gt;Programming concepts (MapReduce, Spark from earlier chapters)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;real-world-examples&quot;&gt;Real-World Examples&lt;/h2&gt;

&lt;p&gt;Big data platforms power critical services:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Lyft&lt;/strong&gt;: 60 PB of queryable event data, 10 PB scanned daily&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Sentinel (ESA)&lt;/strong&gt;: Petabytes of Earth observation data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;NYC Taxi Data&lt;/strong&gt;: 112M+ rows of trip data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Network Monitoring&lt;/strong&gt;: 5M sensors generating 1.4B events/day&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s explore the world of big data platforms!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>10-01 Big Data Fundamentals and Platforms</title>
   <link href="https://nglelinh.github.io/contents/en/chapter10/10_01_Big_Data_Platforms/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter10/10_01_Big_Data_Platforms</id>
   <content type="html">&lt;p&gt;Big Data has become a cornerstone of modern computing, enabling organizations to extract insights from massive datasets that were previously impossible to process. This lecture covers the fundamentals of big data and the platforms that make big data processing possible.&lt;/p&gt;

&lt;h2 id=&quot;what-is-big-data&quot;&gt;What is Big Data?&lt;/h2&gt;

&lt;h3 id=&quot;definition&quot;&gt;Definition&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Big Data&lt;/strong&gt; refers to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Extremely large, complex data sets&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Data that &lt;strong&gt;requires new techniques&lt;/strong&gt; to handle&lt;/li&gt;
  &lt;li&gt;Individual data items can be &lt;strong&gt;small or large&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Simple sensor events (bytes)&lt;/li&gt;
      &lt;li&gt;High-quality satellite images (gigabytes)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!NOTE]
&lt;strong&gt;It’s not just about size&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;Big data is characterized by multiple dimensions beyond just volume.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;the-four-vs-of-big-data&quot;&gt;The Four V’s of Big Data&lt;/h3&gt;

&lt;h4 id=&quot;1-volume&quot;&gt;1. Volume&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Big size&lt;/strong&gt;, large datasets&lt;/li&gt;
  &lt;li&gt;Massive amounts of small data items&lt;/li&gt;
  &lt;li&gt;Examples:
    &lt;ul&gt;
      &lt;li&gt;112M+ rows of NYC taxi trip data&lt;/li&gt;
      &lt;li&gt;60 PB of queryable event data at Lyft&lt;/li&gt;
      &lt;li&gt;Petabytes of satellite imagery&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-variety&quot;&gt;2. Variety&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Complex, different formats&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Multiple types of data and their relationships&lt;/li&gt;
  &lt;li&gt;Examples:
    &lt;ul&gt;
      &lt;li&gt;Structured (databases)&lt;/li&gt;
      &lt;li&gt;Semi-structured (JSON, XML)&lt;/li&gt;
      &lt;li&gt;Unstructured (text, images, video)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-velocity&quot;&gt;3. Velocity&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Generating speed&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Data movement speed&lt;/li&gt;
  &lt;li&gt;Examples:
    &lt;ul&gt;
      &lt;li&gt;1.4 billion events per day from sensors&lt;/li&gt;
      &lt;li&gt;Real-time social media streams&lt;/li&gt;
      &lt;li&gt;High-frequency trading data&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-veracity&quot;&gt;4. Veracity&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Quality varies significantly&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Timeliness, accuracy, trustworthiness&lt;/li&gt;
  &lt;li&gt;Examples:
    &lt;ul&gt;
      &lt;li&gt;Sensor data with noise&lt;/li&gt;
      &lt;li&gt;User-generated content&lt;/li&gt;
      &lt;li&gt;Incomplete records&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Big Data = Volume + Variety +          │
│             Velocity + Veracity          │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;sources-of-big-data&quot;&gt;Sources of Big Data&lt;/h2&gt;

&lt;h3 id=&quot;1-social-media&quot;&gt;1. Social Media&lt;/h3&gt;
&lt;p&gt;Data generated by human activities:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Meta/Facebook, TikTok, Twitter, Instagram&lt;/li&gt;
  &lt;li&gt;User posts, interactions, preferences&lt;/li&gt;
  &lt;li&gt;Billions of users generating content daily&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-internet-of-things-iot--industry-40&quot;&gt;2. Internet of Things (IoT) / Industry 4.0&lt;/h3&gt;
&lt;p&gt;Data from monitoring equipment and environments:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Smart sensors and devices&lt;/li&gt;
  &lt;li&gt;Industrial equipment monitoring&lt;/li&gt;
  &lt;li&gt;Environmental sensors&lt;/li&gt;
  &lt;li&gt;Example: 5M sensors generating 1.4B events/day&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-advanced-sciences&quot;&gt;3. Advanced Sciences&lt;/h3&gt;
&lt;p&gt;Data from advanced instruments:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Earth observation&lt;/strong&gt;: Sentinel satellites, James Webb telescope&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Healthcare&lt;/strong&gt;: Personal health data, disease information&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Genomics&lt;/strong&gt;: DNA sequencing data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-business-data&quot;&gt;4. Business Data&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Customer data&lt;/strong&gt;: Transactions, behavior, preferences&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Asset management&lt;/strong&gt;: Cars, homes, equipment&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Software systems&lt;/strong&gt;: Logs, traces, test results&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;operational-vs-analytical-data&quot;&gt;Operational vs Analytical Data&lt;/h2&gt;

&lt;h3 id=&quot;operational-data-oltp&quot;&gt;Operational Data (OLTP)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Business/system operations&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Read/Write&lt;/strong&gt;: Frequent updates&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OLTP&lt;/strong&gt;: Online Transaction Processing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;E-commerce transactions&lt;/li&gt;
      &lt;li&gt;Banking operations&lt;/li&gt;
      &lt;li&gt;Inventory management&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Operational Database           │
│  - Current state                │
│  - Frequent updates             │
│  - Transaction focused          │
│  - Normalized schema            │
└─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;analytical-data-olap&quot;&gt;Analytical Data (OLAP)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Understanding and optimization&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Historical/Integrated&lt;/strong&gt;: Write once, read many&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;OLAP&lt;/strong&gt;: Online Analytical Processing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Customer behavior analysis&lt;/li&gt;
      &lt;li&gt;Sales trends&lt;/li&gt;
      &lt;li&gt;Predictive models&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────┐
│  Analytical Data Warehouse      │
│  - Historical data              │
│  - Read-optimized               │
│  - Aggregated views             │
│  - Denormalized schema          │
└─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Both types matter&lt;/strong&gt;: Big data platforms must handle both operational and analytical workloads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;why-big-data-matters&quot;&gt;Why Big Data Matters&lt;/h2&gt;

&lt;h3 id=&quot;the-value-of-data&quot;&gt;The Value of Data&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Current hot topics&lt;/strong&gt;: Large Language Models (LLMs) / Generative AI&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top-down perspective&lt;/strong&gt;: Data economy&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;More data → More insights → Better decisions → Business success
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Bottom-up perspective&lt;/strong&gt;: Optimization&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Understanding → Optimizing → Saving cost / Creating value
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;the-unreasonable-effectiveness-of-data&quot;&gt;The Unreasonable Effectiveness of Data&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Key Principle&lt;/strong&gt;: With more data, the same algorithm performs much better!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Source: Halevy, Norvig, and Pereira (2009)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This principle shows that:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Data quality&lt;/strong&gt; can compensate for algorithm simplicity&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;More data&lt;/strong&gt; often beats better algorithms&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt; enables new capabilities&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;what-are-platforms&quot;&gt;What are Platforms?&lt;/h2&gt;

&lt;h3 id=&quot;business-definition&quot;&gt;Business Definition&lt;/h3&gt;

&lt;p&gt;From “Platform Revolution”:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“A platform is a business based on enabling value-creating interactions between external producers and consumers. The platform provides an open, participative infrastructure for these interactions and sets governance conditions for them.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;big-data-platform-interpretation&quot;&gt;Big Data Platform Interpretation&lt;/h3&gt;

&lt;p&gt;Big data platforms are &lt;strong&gt;large-scale service platforms&lt;/strong&gt; that:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Provide &lt;strong&gt;on-demand computing&lt;/strong&gt; for data-centric products&lt;/li&gt;
  &lt;li&gt;Enable &lt;strong&gt;on-demand analytics&lt;/strong&gt; services&lt;/li&gt;
  &lt;li&gt;Offer &lt;strong&gt;on-demand data management&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Enable &lt;strong&gt;interactions&lt;/strong&gt; between data producers and consumers&lt;/li&gt;
  &lt;li&gt;Facilitate &lt;strong&gt;exchange&lt;/strong&gt; of big data and data products&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not just&lt;/strong&gt;: A database or data marketplace (even if big!)&lt;/p&gt;

&lt;h2 id=&quot;data-perspectives&quot;&gt;Data Perspectives&lt;/h2&gt;

&lt;h3 id=&quot;data-as-an-asset&quot;&gt;Data as an Asset&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Concept&lt;/strong&gt;: Data is valuable and must be managed and exploited&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Ownership&lt;/strong&gt;: Clear data ownership&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Governance&lt;/strong&gt;: Policies and controls&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Protection&lt;/strong&gt;: Security and compliance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Exploitation&lt;/strong&gt;: Maximize value extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;data-as-a-product&quot;&gt;Data as a Product&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Concept&lt;/strong&gt;: Product thinking for data processing and delivery&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;User satisfaction&lt;/strong&gt;: Data users are customers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Quality&lt;/strong&gt;: Data must meet user needs&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Discoverability&lt;/strong&gt;: Easy to find and understand&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Self-service&lt;/strong&gt;: Users can access independently&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Modern approach&lt;/strong&gt;: Combine both perspectives for effective data management&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;big-data-platform-architecture&quot;&gt;Big Data Platform Architecture&lt;/h2&gt;

&lt;h3 id=&quot;onion-architecture&quot;&gt;Onion Architecture&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Big Data Services &amp;amp; Applications       │
│  (Analytics, ML, BI, Dashboards)        │
├─────────────────────────────────────────┤
│  Middleware Platforms                   │
│  (Building, deploying, operating        │
│   reliable big data services)           │
├─────────────────────────────────────────┤
│  Data-Centric Virtualized               │
│  Infrastructures                        │
│  (Compute, Storage, Network)            │
├─────────────────────────────────────────┤
│  Consumers &amp;amp; Producers                  │
│  (Sensors, Things, People, Processes)   │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;core-platform-services&quot;&gt;Core Platform Services&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Data Sources                           │
│  (IoT, B2B/B2C, Transport, etc.)        │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Core Services                          │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │  Ingest  │  │  Store   │            │
│  └──────────┘  └──────────┘            │
│  ┌──────────┐  ┌──────────┐            │
│  │  Data    │  │  Data    │            │
│  │Processing│  │  Query   │            │
│  └──────────┘  └──────────┘            │
└────────────┬────────────────────────────┘
             │
┌────────────▼────────────────────────────┐
│  Analytics &amp;amp; Applications               │
│  - ML Algorithms &amp;amp; Pipelines            │
│  - Visualization                        │
│  - Business Applications                │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;big-data-pipelines&quot;&gt;Big Data Pipelines&lt;/h2&gt;

&lt;h3 id=&quot;typical-pipeline-stages&quot;&gt;Typical Pipeline Stages&lt;/h3&gt;

&lt;h4 id=&quot;1-data-ingestion&quot;&gt;1. Data Ingestion&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Extract&lt;/strong&gt; data from various sources&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Transform&lt;/strong&gt; data into usable formats&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load&lt;/strong&gt; data into storage (ETL)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-data-storage&quot;&gt;2. Data Storage&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Raw data&lt;/strong&gt; storage (data lake)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Processed data&lt;/strong&gt; storage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Analytical data&lt;/strong&gt; storage (data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-data-processing&quot;&gt;3. Data Processing&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Batch processing&lt;/strong&gt; (Hadoop, Spark)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stream processing&lt;/strong&gt; (Kafka, Flink)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Machine learning&lt;/strong&gt; (training, inference)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-data-analysis&quot;&gt;4. Data Analysis&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Querying&lt;/strong&gt; (SQL, NoSQL)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Analytics&lt;/strong&gt; (aggregations, statistics)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Visualization&lt;/strong&gt; (dashboards, reports)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-pipeline&quot;&gt;Example Pipeline&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Data Source → Ingestion → Raw Data Store
                ↓
         Data Processing
                ↓
         Analytical Store → Visualization
                ↓
         Machine Learning → Model Serving
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;core-principles-for-big-data-platforms&quot;&gt;Core Principles for Big Data Platforms&lt;/h2&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  You (Big Data Platform Expert)         │
│                                         │
│  ┌─────────────┐  ┌─────────────┐      │
│  │Programming  │  │ Data Mgmt   │      │
│  │Models &amp;amp;     │  │ Models &amp;amp;    │      │
│  │Frameworks   │  │ Tools       │      │
│  └─────────────┘  └─────────────┘      │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Large-Scale Computing       │       │
│  │ Platforms (Service-Based)   │       │
│  └─────────────────────────────┘       │
│                                         │
│  ┌─────────────────────────────┐       │
│  │ Provisioning, Automation    │       │
│  │ and Analytics Processes     │       │
│  └─────────────────────────────┘       │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;key-focus-areas&quot;&gt;Key Focus Areas&lt;/h2&gt;

&lt;h3 id=&quot;1-designdevelopment-vs-operation&quot;&gt;1. Design/Development vs Operation&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Design&lt;/strong&gt;: Architecture, data models, algorithms&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Operation&lt;/strong&gt;: Deployment, monitoring, maintenance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-data-centric-vs-service-centric&quot;&gt;2. Data-Centric vs Service-Centric&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Data-centric&lt;/strong&gt;: Data models, storage, governance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service-centric&lt;/strong&gt;: APIs, microservices, orchestration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Platform-centric&lt;/strong&gt;: Infrastructure, scalability, reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-sql-style-vs-programmatic-processing&quot;&gt;3. SQL-Style vs Programmatic Processing&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;SQL-style&lt;/strong&gt;: Declarative queries, BI tools&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Programmatic&lt;/strong&gt;: MapReduce, Spark, custom workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-quality-and-governance&quot;&gt;4. Quality and Governance&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Data quality&lt;/strong&gt;: Accuracy, completeness, timeliness&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Governance&lt;/strong&gt;: Policies, compliance, security&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;learning-goals&quot;&gt;Learning Goals&lt;/h2&gt;

&lt;h3 id=&quot;as-a-user&quot;&gt;As a User&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Able to &lt;strong&gt;use and program&lt;/strong&gt; atop big data platforms&lt;/li&gt;
  &lt;li&gt;Understand available services and APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;as-a-provider&quot;&gt;As a Provider&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Able to &lt;strong&gt;operate&lt;/strong&gt; big data platforms&lt;/li&gt;
  &lt;li&gt;Monitor, scale, and maintain systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;as-a-designerarchitect&quot;&gt;As a Designer/Architect&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Able to &lt;strong&gt;design&lt;/strong&gt; new solutions for big data platforms&lt;/li&gt;
  &lt;li&gt;Make architectural decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;as-a-developer&quot;&gt;As a Developer&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Able to &lt;strong&gt;develop&lt;/strong&gt; services/applications in big data platforms&lt;/li&gt;
  &lt;li&gt;Implement data pipelines and analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;data-governance-concerns&quot;&gt;Data Governance Concerns&lt;/h2&gt;

&lt;h3 id=&quot;key-challenges&quot;&gt;Key Challenges&lt;/h3&gt;

&lt;h4 id=&quot;1-data-quality&quot;&gt;1. Data Quality&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;: Is the data correct?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Completeness&lt;/strong&gt;: Is all data present?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Timeliness&lt;/strong&gt;: Is data current?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-data-lineage&quot;&gt;2. Data Lineage&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Tracking&lt;/strong&gt;: Where did data come from?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Transformations&lt;/strong&gt;: How was it processed?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Audit trail&lt;/strong&gt;: Who accessed it?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-multi-tenancy&quot;&gt;3. Multi-Tenancy&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Separate tenant data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SLAs&lt;/strong&gt;: Different service levels&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: Access control&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-compliance&quot;&gt;4. Compliance&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;GDPR&lt;/strong&gt;: Right to be forgotten&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Privacy&lt;/strong&gt;: Data protection&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Regulations&lt;/strong&gt;: Industry-specific rules&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-concerns&quot;&gt;Example Concerns&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Data → Analysis → Result
  ↓       ↓         ↓
Where?  Price?  Quality?
Privacy? Ethics? Compliance?
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;processing-models&quot;&gt;Processing Models&lt;/h2&gt;

&lt;h3 id=&quot;batch-processing&quot;&gt;Batch Processing&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Process large volumes of data&lt;/li&gt;
  &lt;li&gt;Not real-time (hours to days)&lt;/li&gt;
  &lt;li&gt;High throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Daily customer transactions → Batch ETL → Data warehouse
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;stream-processing&quot;&gt;Stream Processing&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Process data as it arrives&lt;/li&gt;
  &lt;li&gt;Low latency (seconds to minutes)&lt;/li&gt;
  &lt;li&gt;Continuous processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;IoT sensor data → Stream processing → Real-time alerts
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;real-time-analytics&quot;&gt;Real-Time Analytics&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Immediate insights&lt;/li&gt;
  &lt;li&gt;Sub-second latency&lt;/li&gt;
  &lt;li&gt;Interactive queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;User clicks → Real-time recommendation engine
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Big data platforms are essential infrastructure for modern data-driven applications:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Big data&lt;/strong&gt; is characterized by Volume, Variety, Velocity, and Veracity&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Platforms&lt;/strong&gt; enable interactions between data producers and consumers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Architectures&lt;/strong&gt; follow layered approaches (onion model)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pipelines&lt;/strong&gt; move data through ingestion, storage, processing, and analysis&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Governance&lt;/strong&gt; ensures quality, security, and compliance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Multiple processing models&lt;/strong&gt; serve different use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding big data platforms is fundamental for building scalable, data-intensive applications.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf&quot;&gt;The Unreasonable Effectiveness of Data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Platform-Revolution-Networked-Markets-Transforming/&quot;&gt;Platform Revolution&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/data-mesh/9781492092384/&quot;&gt;Data Mesh by Zhamak Dehghani&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>Chapter 09 - Kubernetes and Container Orchestration</title>
   <link href="https://nglelinh.github.io/contents/en/chapter09/09_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter09/09_Introduction</id>
   <content type="html">&lt;p&gt;Welcome to Chapter 09: &lt;strong&gt;Kubernetes and Container Orchestration&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;chapter-overview&quot;&gt;Chapter Overview&lt;/h2&gt;

&lt;p&gt;Kubernetes has become the de facto standard for container orchestration, enabling organizations to deploy, scale, and manage containerized applications at massive scale. This chapter explores Kubernetes architecture, core concepts, and practical usage.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;By the end of this chapter, you will be able to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Understand container orchestration&lt;/strong&gt; and why it’s necessary&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Explain Kubernetes architecture&lt;/strong&gt; including master and worker nodes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Work with core Kubernetes objects&lt;/strong&gt;: Pods, Services, Deployments, ReplicaSets&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Configure networking&lt;/strong&gt; in Kubernetes clusters&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deploy and manage applications&lt;/strong&gt; using Kubernetes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Implement self-healing&lt;/strong&gt; and auto-scaling patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;topics-covered&quot;&gt;Topics Covered&lt;/h2&gt;

&lt;h3 id=&quot;1-container-orchestration-fundamentals&quot;&gt;1. Container Orchestration Fundamentals&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;What is orchestration and why do we need it?&lt;/li&gt;
  &lt;li&gt;Orchestration tasks: resource allocation, scaling, load balancing&lt;/li&gt;
  &lt;li&gt;Orchestrator options: Kubernetes, Docker Swarm, Apache Mesos&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-kubernetes-architecture&quot;&gt;2. Kubernetes Architecture&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Control plane components&lt;/li&gt;
  &lt;li&gt;Worker node components&lt;/li&gt;
  &lt;li&gt;Self-healing and desired state management&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-core-kubernetes-objects&quot;&gt;3. Core Kubernetes Objects&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Namespaces&lt;/strong&gt;: Logical cluster partitioning&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pods&lt;/strong&gt;: Smallest deployable units&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ReplicaSets&lt;/strong&gt;: Ensuring pod availability&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deployments&lt;/strong&gt;: Managing application versions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Services&lt;/strong&gt;: Stable network endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-kubernetes-networking&quot;&gt;4. Kubernetes Networking&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Container-to-container communication&lt;/li&gt;
  &lt;li&gt;Pod-to-pod communication&lt;/li&gt;
  &lt;li&gt;Service discovery and DNS&lt;/li&gt;
  &lt;li&gt;Service types: ClusterIP, NodePort, LoadBalancer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-advanced-concepts&quot;&gt;5. Advanced Concepts&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;ConfigMaps and Secrets&lt;/li&gt;
  &lt;li&gt;Labels and selectors&lt;/li&gt;
  &lt;li&gt;Health checks and probes&lt;/li&gt;
  &lt;li&gt;Jobs and DaemonSets&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-kubernetes-matters&quot;&gt;Why Kubernetes Matters&lt;/h2&gt;

&lt;p&gt;Kubernetes solves critical challenges in modern application deployment:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: Manage thousands of containers across hundreds of nodes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resilience&lt;/strong&gt;: Automatic recovery from failures&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Portability&lt;/strong&gt;: Run anywhere (on-premises, cloud, hybrid)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;: Optimal resource utilization&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Velocity&lt;/strong&gt;: Faster deployment and iteration cycles&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;

&lt;p&gt;To get the most out of this chapter, you should understand:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Container concepts (Docker) from Chapter 08&lt;/li&gt;
  &lt;li&gt;Basic networking concepts&lt;/li&gt;
  &lt;li&gt;YAML configuration syntax&lt;/li&gt;
  &lt;li&gt;Command-line operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;real-world-applications&quot;&gt;Real-World Applications&lt;/h2&gt;

&lt;p&gt;Kubernetes powers some of the world’s largest applications:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Google&lt;/strong&gt;: Runs billions of containers per week&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Spotify&lt;/strong&gt;: Manages global music streaming infrastructure&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Airbnb&lt;/strong&gt;: Handles dynamic scaling for travel bookings&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pokemon Go&lt;/strong&gt;: Scaled to handle massive launch traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s dive into the world of Kubernetes!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>09-03 Services, Deployments, and Networking</title>
   <link href="https://nglelinh.github.io/contents/en/chapter09/09_03_Services_Deployments/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter09/09_03_Services_Deployments</id>
   <content type="html">&lt;p&gt;While Pods are the fundamental unit in Kubernetes, they are ephemeral and can be replaced at any time. Services provide stable network endpoints, and Deployments manage pod replicas and updates. This lecture covers these essential Kubernetes objects.&lt;/p&gt;

&lt;h2 id=&quot;the-networking-challenge&quot;&gt;The Networking Challenge&lt;/h2&gt;

&lt;h3 id=&quot;problem-pods-are-ephemeral&quot;&gt;Problem: Pods are Ephemeral&lt;/h3&gt;

&lt;p&gt;Pods have dynamic, changing IP addresses:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Pod nginx-abc (IP: 10.244.1.5) → Crashes
Pod nginx-xyz (IP: 10.244.2.8) → Created as replacement

How do clients find the new pod?
How do we load balance across multiple pods?
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;communication-challenges&quot;&gt;Communication Challenges&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Between pods&lt;/strong&gt;: Using hardcoded IPs fails when pods are rescheduled&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;From outside&lt;/strong&gt;: Need to track all pods providing a service and load balance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service discovery&lt;/strong&gt;: How to find which pods provide which services?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;kubernetes-services&quot;&gt;Kubernetes Services&lt;/h2&gt;

&lt;h3 id=&quot;what-is-a-service&quot;&gt;What is a Service?&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Service&lt;/strong&gt; is:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;An &lt;strong&gt;abstraction&lt;/strong&gt; defining a logical set of pods&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;stable network endpoint&lt;/strong&gt; (IP and DNS name)&lt;/li&gt;
  &lt;li&gt;A &lt;strong&gt;load balancer&lt;/strong&gt; across pod backends&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Durable&lt;/strong&gt; - survives pod restarts and rescheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Service: web-service                   │
│  IP: 10.96.28.176 (stable)              │
│  DNS: web-service.default.svc.cluster.local │
└────────────┬────────────────────────────┘
             │ Load balances to:
      ┌──────┼──────┬──────────┐
      ▼      ▼      ▼          ▼
   [Pod 1][Pod 2][Pod 3]  [Pod 4]
   10.244.1.5  10.244.1.6  10.244.2.7  10.244.2.8
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;service-characteristics&quot;&gt;Service Characteristics&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Static cluster IP&lt;/strong&gt;: Allocated on creation, doesn’t change&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Static DNS name&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;service-name&amp;gt;.&amp;lt;namespace&amp;gt;.svc.cluster.local&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Label selector&lt;/strong&gt;: Selects pods to route traffic to&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balancing&lt;/strong&gt;: Distributes traffic across healthy pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service discovery&lt;/strong&gt;: Automatic DNS registration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;service-specification&quot;&gt;Service Specification&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;protocol&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;TCP&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Service port&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Pod port&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ClusterIP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;service-types&quot;&gt;Service Types&lt;/h2&gt;

&lt;p&gt;Kubernetes provides four types of services:&lt;/p&gt;

&lt;h3 id=&quot;1-clusterip-default&quot;&gt;1. ClusterIP (Default)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Internal cluster communication only&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Accessible&lt;/strong&gt; only within the cluster&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Backend services, databases&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balancing&lt;/strong&gt;: Across all selected pods&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;backend-service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ClusterIP&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;backend&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;DNS Resolution:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# From any pod in the cluster&lt;/span&gt;
curl http://backend-service:8080
&lt;span class=&quot;c&quot;&gt;# or&lt;/span&gt;
curl http://backend-service.default.svc.cluster.local:8080
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-nodeport&quot;&gt;2. NodePort&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Expose service on each node’s IP at a static port&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Extends&lt;/strong&gt; ClusterIP&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Exposes&lt;/strong&gt; port on every node (30000-32767 range)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Development, testing, simple external access&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-nodeport&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;NodePort&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;nodePort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;32410&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Optional, auto-assigned if not specified&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Access:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# From outside cluster&lt;/span&gt;
curl http://&amp;lt;any-node-ip&amp;gt;:32410
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;External Client
      │
      ▼
Node IP:32410 ──┐
Node IP:32410 ──┼→ Service → Pods
Node IP:32410 ──┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-loadbalancer&quot;&gt;3. LoadBalancer&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Expose service via cloud provider’s load balancer&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Extends&lt;/strong&gt; NodePort&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Provisions&lt;/strong&gt; external load balancer (AWS ELB, GCP LB, Azure LB)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Production external access&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Requires&lt;/strong&gt;: Cloud provider support&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-loadbalancer&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;LoadBalancer&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;External Load Balancer (172.17.18.43)
      │
      ▼
NodePort (32410)
      │
      ▼
Service (10.96.28.176)
      │
      ▼
Pods
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;4-externalname&quot;&gt;4. ExternalName&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Map service to external DNS name&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No proxy&lt;/strong&gt; or load balancing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Returns&lt;/strong&gt; CNAME record&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use case&lt;/strong&gt;: Access external services with Kubernetes DNS&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;external-db&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ExternalName&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;externalName&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;database.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;service-discovery&quot;&gt;Service Discovery&lt;/h2&gt;

&lt;p&gt;Kubernetes provides automatic service discovery via DNS:&lt;/p&gt;

&lt;h3 id=&quot;dns-names&quot;&gt;DNS Names&lt;/h3&gt;

&lt;p&gt;Every service gets a DNS name:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&amp;lt;service-name&amp;gt;.&amp;lt;namespace&amp;gt;.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Examples:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;web-service.default.svc.cluster.local
database.production.svc.cluster.local
api.staging.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;using-service-discovery&quot;&gt;Using Service Discovery&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Pod can reference service by name&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Pod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;frontend&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;myapp&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;BACKEND_URL&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;http://backend-service:8080&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;kube-proxy&quot;&gt;kube-proxy&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;kube-proxy&lt;/strong&gt; implements Services:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Runs on &lt;strong&gt;every node&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Maintains &lt;strong&gt;iptables rules&lt;/strong&gt; (or IPVS)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Routes&lt;/strong&gt; traffic to pod backends&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balances&lt;/strong&gt; across pods&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Client Request
      │
      ▼
kube-proxy (iptables rules)
      │
      ├→ Pod 1 (33% traffic)
      ├→ Pod 2 (33% traffic)
      └→ Pod 3 (34% traffic)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;fundamental-networking-rules&quot;&gt;Fundamental Networking Rules&lt;/h2&gt;

&lt;p&gt;Kubernetes networking follows these principles:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;All containers within a pod&lt;/strong&gt; can communicate unimpeded&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;All pods can communicate&lt;/strong&gt; with all other pods without NAT&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;All nodes can communicate&lt;/strong&gt; with all pods (and vice versa) without NAT&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The IP a pod sees itself as&lt;/strong&gt; is the same IP others see it as&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;deployments&quot;&gt;Deployments&lt;/h2&gt;

&lt;h3 id=&quot;the-problem-with-bare-pods&quot;&gt;The Problem with Bare Pods&lt;/h3&gt;

&lt;p&gt;Managing pods directly is problematic:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No automatic restart&lt;/strong&gt; if pod dies&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;No scaling&lt;/strong&gt; - must create each pod manually&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;No rolling updates&lt;/strong&gt; - must manually replace pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;No rollback&lt;/strong&gt; capability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;what-is-a-deployment&quot;&gt;What is a Deployment?&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Deployment&lt;/strong&gt; provides:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Declarative updates&lt;/strong&gt; for pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Replica management&lt;/strong&gt; - ensures desired number of pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt; - zero-downtime deployments&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rollback&lt;/strong&gt; capability&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scaling&lt;/strong&gt; - easy scale up/down&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Deployment
    │
    ├→ ReplicaSet (v1)
    │     ├→ Pod 1
    │     ├→ Pod 2
    │     └→ Pod 3
    │
    └→ ReplicaSet (v2) - new version
          ├→ Pod 4
          ├→ Pod 5
          └→ Pod 6
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;deployment-specification&quot;&gt;Deployment Specification&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx-deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;revisionHistoryLimit&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;RollingUpdate&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;rollingUpdate&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;maxSurge&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;maxUnavailable&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:1.21&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;key-deployment-fields&quot;&gt;Key Deployment Fields&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Field&lt;/th&gt;
      &lt;th&gt;Purpose&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;replicas&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Desired number of pod instances&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;selector&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Label selector for pods&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;template&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Pod template for creating pods&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strategy&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Update strategy (RollingUpdate or Recreate)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;revisionHistoryLimit&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Number of old ReplicaSets to retain&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;update-strategies&quot;&gt;Update Strategies&lt;/h2&gt;

&lt;h3 id=&quot;1-rollingupdate-default&quot;&gt;1. RollingUpdate (Default)&lt;/h3&gt;

&lt;p&gt;Gradually replaces old pods with new ones:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;RollingUpdate&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;rollingUpdate&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;maxSurge&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Max pods above desired count&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;maxUnavailable&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Max pods below desired count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Process:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Initial: [v1] [v1] [v1]
Step 1:  [v1] [v1] [v1] [v2]  (maxSurge: 1)
Step 2:  [v1] [v1] [v2]       (remove old)
Step 3:  [v1] [v1] [v2] [v2]
Step 4:  [v1] [v2] [v2]
Step 5:  [v1] [v2] [v2] [v2]
Final:   [v2] [v2] [v2]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-recreate&quot;&gt;2. Recreate&lt;/h3&gt;

&lt;p&gt;Kills all old pods before creating new ones:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;strategy&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Recreate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Process:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Initial: [v1] [v1] [v1]
Step 1:  [ ]  [ ]  [ ]   (kill all)
Step 2:  [v2] [v2] [v2]  (create new)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Downtime&lt;/strong&gt;: Recreate strategy causes downtime&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;working-with-deployments&quot;&gt;Working with Deployments&lt;/h2&gt;

&lt;h3 id=&quot;creating-deployments&quot;&gt;Creating Deployments&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# From YAML&lt;/span&gt;
kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; deployment.yaml

&lt;span class=&quot;c&quot;&gt;# Imperative&lt;/span&gt;
kubectl create deployment nginx &lt;span class=&quot;nt&quot;&gt;--image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx &lt;span class=&quot;nt&quot;&gt;--replicas&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3

&lt;span class=&quot;c&quot;&gt;# With labels&lt;/span&gt;
kubectl create deployment nginx &lt;span class=&quot;nt&quot;&gt;--image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx &lt;span class=&quot;nt&quot;&gt;--replicas&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;--labels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;web,env&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;prod
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;viewing-deployments&quot;&gt;Viewing Deployments&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# List deployments&lt;/span&gt;
kubectl get deployments

&lt;span class=&quot;c&quot;&gt;# Detailed information&lt;/span&gt;
kubectl describe deployment nginx-deployment

&lt;span class=&quot;c&quot;&gt;# View rollout status&lt;/span&gt;
kubectl rollout status deployment nginx-deployment

&lt;span class=&quot;c&quot;&gt;# View rollout history&lt;/span&gt;
kubectl rollout &lt;span class=&quot;nb&quot;&gt;history &lt;/span&gt;deployment nginx-deployment
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;updating-deployments&quot;&gt;Updating Deployments&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Update image&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;image deployment/nginx-deployment &lt;span class=&quot;nv&quot;&gt;nginx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx:1.22 &lt;span class=&quot;nt&quot;&gt;--record&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Edit deployment&lt;/span&gt;
kubectl edit deployment nginx-deployment

&lt;span class=&quot;c&quot;&gt;# Scale deployment&lt;/span&gt;
kubectl scale deployment nginx-deployment &lt;span class=&quot;nt&quot;&gt;--replicas&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5

&lt;span class=&quot;c&quot;&gt;# Apply updated YAML&lt;/span&gt;
kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; deployment.yaml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;rolling-back&quot;&gt;Rolling Back&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Rollback to previous version&lt;/span&gt;
kubectl rollout undo deployment nginx-deployment

&lt;span class=&quot;c&quot;&gt;# Rollback to specific revision&lt;/span&gt;
kubectl rollout undo deployment nginx-deployment &lt;span class=&quot;nt&quot;&gt;--to-revision&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2

&lt;span class=&quot;c&quot;&gt;# Pause rollout&lt;/span&gt;
kubectl rollout pause deployment nginx-deployment

&lt;span class=&quot;c&quot;&gt;# Resume rollout&lt;/span&gt;
kubectl rollout resume deployment nginx-deployment
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;replicasets&quot;&gt;ReplicaSets&lt;/h2&gt;

&lt;h3 id=&quot;what-is-a-replicaset&quot;&gt;What is a ReplicaSet?&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;ReplicaSet&lt;/strong&gt; ensures a specified number of pod replicas are running:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Created automatically&lt;/strong&gt; by Deployments&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Maintains&lt;/strong&gt; desired number of pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Replaces&lt;/strong&gt; failed pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Adopts&lt;/strong&gt; existing pods with matching labels&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Don’t create ReplicaSets directly. Use Deployments instead.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;replicaset-specification&quot;&gt;ReplicaSet Specification&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ReplicaSet&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;rs-example&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:stable-alpine&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;replicaset-behavior&quot;&gt;ReplicaSet Behavior&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# If you have 3 pods and delete one&lt;/span&gt;
kubectl delete pod nginx-abc

&lt;span class=&quot;c&quot;&gt;# ReplicaSet immediately creates a new pod&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# to maintain replicas: 3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Quarantine pods&lt;/strong&gt;: Remove label to exclude from ReplicaSet&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl label pod nginx-abc app-
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;labels-and-selectors&quot;&gt;Labels and Selectors&lt;/h2&gt;

&lt;h3 id=&quot;labels&quot;&gt;Labels&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Labels&lt;/strong&gt; are key-value pairs attached to objects:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;tier&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;frontend&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;version&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1.2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Characteristics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Not unique&lt;/strong&gt; - multiple objects can have same labels&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Flexible&lt;/strong&gt; - add/remove anytime&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Queryable&lt;/strong&gt; - select objects by labels&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;selectors&quot;&gt;Selectors&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Selectors&lt;/strong&gt; filter objects by labels:&lt;/p&gt;

&lt;h4 id=&quot;equality-based&quot;&gt;Equality-Based&lt;/h4&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# kubectl with label selector&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx
kubectl get pods &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx,env&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;prod
kubectl get pods &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; app!&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;set-based&quot;&gt;Set-Based&lt;/h4&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;matchExpressions&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;env&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;operator&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;In&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;prod&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;staging&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;tier&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;operator&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;NotIn&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;cache&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;operator&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Exists&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Operators&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;In&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NotIn&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Exists&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DoesNotExist&lt;/code&gt;&lt;/p&gt;

&lt;h2 id=&quot;complete-example-web-application&quot;&gt;Complete Example: Web Application&lt;/h2&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-app&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:1.21&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;span class=&quot;nn&quot;&gt;---&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-service&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;LoadBalancer&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;targetPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Deploy:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; web-app.yaml

&lt;span class=&quot;c&quot;&gt;# Check deployment&lt;/span&gt;
kubectl get deployments
kubectl get pods
kubectl get services

&lt;span class=&quot;c&quot;&gt;# Access application&lt;/span&gt;
kubectl get service web-service
&lt;span class=&quot;c&quot;&gt;# Use EXTERNAL-IP to access&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Kubernetes Services and Deployments provide production-ready application management:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Services&lt;/strong&gt; provide stable network endpoints and load balancing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Four service types&lt;/strong&gt;: ClusterIP, NodePort, LoadBalancer, ExternalName&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deployments&lt;/strong&gt; manage pod replicas and updates&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt; enable zero-downtime deployments&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ReplicaSets&lt;/strong&gt; ensure desired pod count&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Labels and selectors&lt;/strong&gt; enable flexible object grouping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these objects form the foundation of Kubernetes application deployment.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/services-networking/service/&quot;&gt;Kubernetes Services&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/workloads/controllers/deployment/&quot;&gt;Kubernetes Deployments&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/&quot;&gt;Labels and Selectors&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>09-02 Pods and Workload Management</title>
   <link href="https://nglelinh.github.io/contents/en/chapter09/09_02_Pods/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter09/09_02_Pods</id>
   <content type="html">&lt;p&gt;Pods are the smallest and most fundamental deployable units in Kubernetes. Understanding pods and how to manage them is essential for working with Kubernetes effectively.&lt;/p&gt;

&lt;h2 id=&quot;what-is-a-pod&quot;&gt;What is a Pod?&lt;/h2&gt;

&lt;h3 id=&quot;definition&quot;&gt;Definition&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Pod&lt;/strong&gt; is:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The &lt;strong&gt;smallest deployable unit&lt;/strong&gt; in Kubernetes&lt;/li&gt;
  &lt;li&gt;A group of &lt;strong&gt;one or more containers&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Containers in a pod &lt;strong&gt;share resources&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scheduled together&lt;/strong&gt; on the same node&lt;/li&gt;
  &lt;li&gt;Represents a &lt;strong&gt;single instance&lt;/strong&gt; of an application&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!IMPORTANT]
&lt;strong&gt;Can you deploy a container in Kubernetes?&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;&lt;strong&gt;NO&lt;/strong&gt; (not directly). The smallest unit is a Pod, which contains one or more containers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;why-pods-not-containers&quot;&gt;Why Pods, Not Containers?&lt;/h3&gt;

&lt;p&gt;Kubernetes uses pods instead of containers directly because:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Colocation&lt;/strong&gt;: Some containers need to be kept together&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource sharing&lt;/strong&gt;: Containers in a pod share network and storage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Atomic unit&lt;/strong&gt;: All containers in a pod are scheduled together&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lifecycle management&lt;/strong&gt;: Pod represents a single logical application&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Pod                                │
│  ┌───────────┐    ┌───────────┐    │
│  │Container 1│    │Container 2│    │
│  │  (App)    │    │ (Sidecar) │    │
│  └───────────┘    └───────────┘    │
│                                     │
│  Shared:                            │
│  • Network namespace (same IP)      │
│  • IPC namespace                    │
│  • Volumes                          │
│  • Hostname                         │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;pod-characteristics&quot;&gt;Pod Characteristics&lt;/h2&gt;

&lt;h3 id=&quot;shared-resources&quot;&gt;Shared Resources&lt;/h3&gt;

&lt;p&gt;Containers within the same pod share:&lt;/p&gt;

&lt;h4 id=&quot;1-network-namespace&quot;&gt;1. Network Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Same IP address&lt;/strong&gt; and port space&lt;/li&gt;
  &lt;li&gt;Communicate via &lt;strong&gt;localhost&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Share network ports (no conflicts allowed)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Both containers share IP 10.244.1.5&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;Pod IP&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;10.244.1.5&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;Container 1&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;localhost:8080&lt;/span&gt;
  &lt;span class=&quot;s&quot;&gt;Container 2&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;localhost:9090&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;2-ipc-namespace&quot;&gt;2. IPC Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Inter-Process Communication&lt;/strong&gt; channels&lt;/li&gt;
  &lt;li&gt;Can use System V IPC or POSIX message queues&lt;/li&gt;
  &lt;li&gt;Enables fast communication between containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-hostname&quot;&gt;3. Hostname&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;All containers see the &lt;strong&gt;same hostname&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Hostname is the pod name&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-volumes&quot;&gt;4. Volumes&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Shared storage&lt;/strong&gt; mounted to pod&lt;/li&gt;
  &lt;li&gt;All containers can access same volumes&lt;/li&gt;
  &lt;li&gt;Enables data sharing between containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;separate-resources&quot;&gt;Separate Resources&lt;/h3&gt;

&lt;p&gt;Each container has its own:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;cgroup&lt;/strong&gt; (CPU and RAM allocation)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Filesystem&lt;/strong&gt; (unless using shared volumes)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Process namespace&lt;/strong&gt; (by default)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;pod-specification&quot;&gt;Pod Specification&lt;/h2&gt;

&lt;h3 id=&quot;basic-pod-yaml&quot;&gt;Basic Pod YAML&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Pod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx-pod&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;prod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:1.21&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;complete-pod-example&quot;&gt;Complete Pod Example&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Pod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;instructor-test-01&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;sklearn&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;tier&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;backend&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;restartPolicy&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Never&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;instructor-sklearn&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;crdsba6190deveastus001.azurecr.io/instructor_sklearn:latest&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Volume mounts&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;volumeMounts&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;datalake&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;mountPath&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/mnt/datalake/&quot;&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;readOnly&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Command to run&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;command&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/bin/bash&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;-c&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;--&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;pi&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;true;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;do&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;30;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s&quot;&gt;done;&quot;&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;]&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;# Resource limits&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;resources&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;limits&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;2Gi&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;200m&quot;&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;memory&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1Gi&quot;&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;100m&quot;&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;# Image pull secrets&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;imagePullSecrets&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;acr-secret&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;# Volumes&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;datalake&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;claimName&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;pvc-datalake-class-blob&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;key-spec-fields&quot;&gt;Key Spec Fields&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Field&lt;/th&gt;
      &lt;th&gt;Purpose&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;containers&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;List of containers in the pod&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;volumes&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Storage volumes available to containers&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;restartPolicy&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;When to restart containers (Always, OnFailure, Never)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;imagePullSecrets&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Credentials for private registries&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nodeSelector&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Select which nodes can run this pod&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tolerations&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Allow pod to run on tainted nodes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;affinity&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Advanced scheduling rules&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;when-to-use-multiple-containers-in-a-pod&quot;&gt;When to Use Multiple Containers in a Pod&lt;/h2&gt;

&lt;p&gt;Use multiple containers in a pod when:&lt;/p&gt;

&lt;h3 id=&quot;1-impossible-to-work-on-different-machines&quot;&gt;1. Impossible to Work on Different Machines&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Containers need to share local filesystem&lt;/li&gt;
  &lt;li&gt;Containers use IPC for communication&lt;/li&gt;
  &lt;li&gt;Very tight coupling required&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-adapter-pattern&quot;&gt;2. Adapter Pattern&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;One container facilitates communication for another&lt;/li&gt;
  &lt;li&gt;Example: Log aggregator sidecar&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-sidecar-pattern&quot;&gt;3. Sidecar Pattern&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;One container offers support for the main container&lt;/li&gt;
  &lt;li&gt;Examples: Logging, monitoring, proxying&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Pod: Web Application               │
│  ┌──────────────┐  ┌──────────────┐ │
│  │ Main App     │  │ Log Shipper  │ │
│  │ (nginx)      │  │ (fluentd)    │ │
│  │              │  │              │ │
│  │ Writes logs  │→ │ Ships logs   │ │
│  │ to /var/log  │  │ to central   │ │
│  └──────────────┘  └──────────────┘ │
│         ↓                  ↓         │
│    [Shared Volume: /var/log]        │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;4-ambassador-pattern&quot;&gt;4. Ambassador Pattern&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;One container configures or proxies for another&lt;/li&gt;
  &lt;li&gt;Example: Database proxy&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Most pods should have &lt;strong&gt;one container&lt;/strong&gt;. Only use multiple containers when there’s a strong reason.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;pod-lifecycle&quot;&gt;Pod Lifecycle&lt;/h2&gt;

&lt;h3 id=&quot;pod-phases&quot;&gt;Pod Phases&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Phase&lt;/th&gt;
      &lt;th&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Pending&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Pod accepted but not yet running&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Running&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Pod bound to node, containers running&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Succeeded&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;All containers exited with status 0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Failed&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;All containers terminated, at least one failed&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Unknown&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Pod state cannot be determined&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;pod-lifecycle-flow&quot;&gt;Pod Lifecycle Flow&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Create Pod
    ↓
[Pending] → Scheduler assigns to node
    ↓
[Running] → Containers start
    ↓
    ├→ [Succeeded] → All containers exit 0
    ├→ [Failed] → Container exits non-zero
    └→ [Unknown] → Communication lost
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;pod-scheduling&quot;&gt;Pod Scheduling&lt;/h2&gt;

&lt;h3 id=&quot;how-pods-are-scheduled&quot;&gt;How Pods are Scheduled&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;User creates&lt;/strong&gt; pod (via kubectl or API)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;API Server&lt;/strong&gt; validates and stores in etcd&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scheduler&lt;/strong&gt; watches for unscheduled pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scheduler selects&lt;/strong&gt; a node based on:
    &lt;ul&gt;
      &lt;li&gt;Resource requirements (CPU, memory)&lt;/li&gt;
      &lt;li&gt;Node selectors and affinity rules&lt;/li&gt;
      &lt;li&gt;Taints and tolerations&lt;/li&gt;
      &lt;li&gt;Current node load&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scheduler binds&lt;/strong&gt; pod to node&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;kubelet&lt;/strong&gt; on node starts containers&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;scheduling-constraints&quot;&gt;Scheduling Constraints&lt;/h3&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Pod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;gpu-pod&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# Node selector - simple key-value matching&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;nodeSelector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;gpu&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nvidia&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;disktype&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ssd&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cuda-app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nvidia/cuda&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;pod-immutability&quot;&gt;Pod Immutability&lt;/h3&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!WARNING]
&lt;strong&gt;Pods are immutable&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;Once a pod is scheduled to a node, it &lt;strong&gt;never moves&lt;/strong&gt;. If the node dies, the pod must be &lt;strong&gt;deleted&lt;/strong&gt; and &lt;strong&gt;recreated&lt;/strong&gt; on another node.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is why we use higher-level controllers (ReplicaSets, Deployments) instead of managing pods directly.&lt;/p&gt;

&lt;h2 id=&quot;pod-health-checks&quot;&gt;Pod Health Checks&lt;/h2&gt;

&lt;p&gt;Kubernetes provides three types of probes to monitor pod health:&lt;/p&gt;

&lt;h3 id=&quot;1-liveness-probe&quot;&gt;1. Liveness Probe&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Is the application running?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If fails → Kubernetes &lt;strong&gt;restarts&lt;/strong&gt; the container&lt;/li&gt;
  &lt;li&gt;Detects deadlocks and hung processes&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;livenessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/healthz&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;initialDelaySeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;15&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-readiness-probe&quot;&gt;2. Readiness Probe&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Is the application ready to serve traffic?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If fails → Pod &lt;strong&gt;removed from service&lt;/strong&gt; endpoints&lt;/li&gt;
  &lt;li&gt;Application still running, just not receiving traffic&lt;/li&gt;
  &lt;li&gt;Useful during startup or when temporarily overloaded&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;readinessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/ready&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;initialDelaySeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-startup-probe&quot;&gt;3. Startup Probe&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Has the application started?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;For slow-starting applications&lt;/li&gt;
  &lt;li&gt;Disables liveness/readiness checks until startup succeeds&lt;/li&gt;
  &lt;li&gt;If fails → Container restarted&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;startupProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/startup&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;failureThreshold&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;30&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;periodSeconds&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;probe-types&quot;&gt;Probe Types&lt;/h3&gt;

&lt;p&gt;Probes can check health in three ways:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# 1. HTTP GET&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;livenessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;httpGet&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/health&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;httpHeaders&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Custom-Header&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Awesome&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 2. TCP Socket&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;livenessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;tcpSocket&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8080&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 3. Exec Command&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;livenessProbe&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;exec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;command&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cat&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;/tmp/healthy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;working-with-pods&quot;&gt;Working with Pods&lt;/h2&gt;

&lt;h3 id=&quot;creating-pods&quot;&gt;Creating Pods&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# From YAML file&lt;/span&gt;
kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; pod.yaml

&lt;span class=&quot;c&quot;&gt;# Imperative (quick testing)&lt;/span&gt;
kubectl run nginx &lt;span class=&quot;nt&quot;&gt;--image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;80

&lt;span class=&quot;c&quot;&gt;# With labels&lt;/span&gt;
kubectl run nginx &lt;span class=&quot;nt&quot;&gt;--image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx &lt;span class=&quot;nt&quot;&gt;--labels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;app=web,env=prod&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;viewing-pods&quot;&gt;Viewing Pods&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# List all pods&lt;/span&gt;
kubectl get pods

&lt;span class=&quot;c&quot;&gt;# List pods with more details&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; wide

&lt;span class=&quot;c&quot;&gt;# List pods with labels&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;--show-labels&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Filter by label&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx

&lt;span class=&quot;c&quot;&gt;# Watch pods in real-time&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;--watch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;inspecting-pods&quot;&gt;Inspecting Pods&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Detailed information&lt;/span&gt;
kubectl describe pod nginx-pod

&lt;span class=&quot;c&quot;&gt;# View logs&lt;/span&gt;
kubectl logs nginx-pod

&lt;span class=&quot;c&quot;&gt;# Logs from specific container (multi-container pod)&lt;/span&gt;
kubectl logs nginx-pod &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; nginx

&lt;span class=&quot;c&quot;&gt;# Follow logs&lt;/span&gt;
kubectl logs &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; nginx-pod

&lt;span class=&quot;c&quot;&gt;# Previous container logs (if restarted)&lt;/span&gt;
kubectl logs nginx-pod &lt;span class=&quot;nt&quot;&gt;--previous&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;interacting-with-pods&quot;&gt;Interacting with Pods&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Execute command in pod&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;nginx-pod &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; /

&lt;span class=&quot;c&quot;&gt;# Interactive shell&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; nginx-pod &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; /bin/bash

&lt;span class=&quot;c&quot;&gt;# For multi-container pods, specify container&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; nginx-pod &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; nginx &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; /bin/bash

&lt;span class=&quot;c&quot;&gt;# Copy files to/from pod&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;cp &lt;/span&gt;nginx-pod:/var/log/nginx.log ./nginx.log
kubectl &lt;span class=&quot;nb&quot;&gt;cp&lt;/span&gt; ./config.yaml nginx-pod:/etc/config.yaml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;deleting-pods&quot;&gt;Deleting Pods&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Delete specific pod&lt;/span&gt;
kubectl delete pod nginx-pod

&lt;span class=&quot;c&quot;&gt;# Delete pods by label&lt;/span&gt;
kubectl delete pods &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx

&lt;span class=&quot;c&quot;&gt;# Delete all pods in namespace&lt;/span&gt;
kubectl delete pods &lt;span class=&quot;nt&quot;&gt;--all&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Force delete (immediate, no graceful shutdown)&lt;/span&gt;
kubectl delete pod nginx-pod &lt;span class=&quot;nt&quot;&gt;--grace-period&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;0 &lt;span class=&quot;nt&quot;&gt;--force&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;pod-templates&quot;&gt;Pod Templates&lt;/h2&gt;

&lt;p&gt;Higher-level controllers (Deployments, ReplicaSets) use &lt;strong&gt;Pod Templates&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Pod Template (used in Deployment)&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:1.21&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pod templates are &lt;strong&gt;pod specs with limited metadata&lt;/strong&gt;. Controllers use these templates to create actual pods.&lt;/p&gt;

&lt;h2 id=&quot;configmaps-and-secrets&quot;&gt;ConfigMaps and Secrets&lt;/h2&gt;

&lt;h3 id=&quot;configmaps&quot;&gt;ConfigMaps&lt;/h3&gt;

&lt;p&gt;Store &lt;strong&gt;configuration data&lt;/strong&gt; for pods:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;ConfigMap&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app-config&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;database_url&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;postgresql://db:5432/myapp&quot;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;log_level&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;info&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Use in Pod:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;myapp&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;envFrom&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;configMapRef&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app-config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;secrets&quot;&gt;Secrets&lt;/h3&gt;

&lt;p&gt;Store &lt;strong&gt;sensitive data&lt;/strong&gt; (passwords, tokens):&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Secret&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;db-secret&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Opaque&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;cGFzc3dvcmQxMjM=&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# base64 encoded&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Use in Pod:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;myapp&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;DB_PASSWORD&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;valueFrom&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;secretKeyRef&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;db-secret&lt;/span&gt;
          &lt;span class=&quot;na&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!NOTE]
&lt;strong&gt;Secrets are mounted as RAM disk&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;Secrets are not written to disk, only stored in memory for security.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Pods are the fundamental building blocks of Kubernetes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Smallest deployable unit&lt;/strong&gt; containing one or more containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Share network, IPC, and volumes&lt;/strong&gt; within the pod&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scheduled atomically&lt;/strong&gt; to a single node&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Immutable&lt;/strong&gt; once scheduled (never moved)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Health checks&lt;/strong&gt; ensure application reliability&lt;/li&gt;
  &lt;li&gt;Usually managed by &lt;strong&gt;higher-level controllers&lt;/strong&gt;, not directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next lecture, we’ll explore &lt;strong&gt;Services&lt;/strong&gt; - how to provide stable network access to pods, and &lt;strong&gt;Deployments&lt;/strong&gt; - how to manage pod replicas and updates.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/workloads/pods/&quot;&gt;Kubernetes Pods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/&quot;&gt;Pod Lifecycle&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/&quot;&gt;Configure Liveness, Readiness and Startup Probes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>09-01 Kubernetes Fundamentals and Architecture</title>
   <link href="https://nglelinh.github.io/contents/en/chapter09/09_01_Kubernetes_Fundamentals/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter09/09_01_Kubernetes_Fundamentals</id>
   <content type="html">&lt;p&gt;Kubernetes (K8s) is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. This lecture covers the fundamentals of Kubernetes and its architecture.&lt;/p&gt;

&lt;h2 id=&quot;what-is-container-orchestration&quot;&gt;What is Container Orchestration?&lt;/h2&gt;

&lt;h3 id=&quot;the-challenge&quot;&gt;The Challenge&lt;/h3&gt;

&lt;p&gt;As applications grow, managing containers manually becomes impossible:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Hundreds or thousands&lt;/strong&gt; of containers across multiple hosts&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Dynamic scaling&lt;/strong&gt; based on load&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service discovery&lt;/strong&gt; - how do containers find each other?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balancing&lt;/strong&gt; across container instances&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Health monitoring&lt;/strong&gt; and automatic recovery&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt; without downtime&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource allocation&lt;/strong&gt; - which host should run which container?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;orchestration-solution&quot;&gt;Orchestration Solution&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;orchestrator&lt;/strong&gt; manages and organizes both hosts and containers running on a cluster:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│  Orchestrator (Kubernetes)                  │
│  ├── Resource allocation                    │
│  ├── Container scheduling                   │
│  ├── Health monitoring                      │
│  ├── Auto-scaling                           │
│  ├── Load balancing                         │
│  ├── Service discovery                      │
│  └── Rolling updates                        │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;key-orchestrator-tasks&quot;&gt;Key Orchestrator Tasks&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Manage networking and access&lt;/strong&gt; between containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Track state&lt;/strong&gt; of containers and nodes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scale services&lt;/strong&gt; up and down based on demand&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balance&lt;/strong&gt; traffic across container instances&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Relocate containers&lt;/strong&gt; when hosts become unresponsive&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service discovery&lt;/strong&gt; - automatic DNS and routing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Attribute storage&lt;/strong&gt; to containers (volumes, secrets)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt; and rollbacks&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;orchestrator-options&quot;&gt;Orchestrator Options&lt;/h2&gt;

&lt;h3 id=&quot;kubernetes&quot;&gt;Kubernetes&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Open-source&lt;/strong&gt; project from Google (now CNCF)&lt;/li&gt;
  &lt;li&gt;Most popular and feature-rich&lt;/li&gt;
  &lt;li&gt;Large ecosystem and community&lt;/li&gt;
  &lt;li&gt;Runs on any infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;docker-swarm&quot;&gt;Docker Swarm&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Integrated&lt;/strong&gt; into Docker platform&lt;/li&gt;
  &lt;li&gt;Simpler to set up than Kubernetes&lt;/li&gt;
  &lt;li&gt;Good for smaller deployments&lt;/li&gt;
  &lt;li&gt;Less features than Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;apache-mesos&quot;&gt;Apache Mesos&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Cluster management&lt;/strong&gt; tool&lt;/li&gt;
  &lt;li&gt;Container orchestration via Marathon plugin&lt;/li&gt;
  &lt;li&gt;Can handle non-container workloads too&lt;/li&gt;
  &lt;li&gt;More complex setup&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Winner&lt;/strong&gt;: Kubernetes has become the industry standard&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;what-is-kubernetes&quot;&gt;What is Kubernetes?&lt;/h2&gt;

&lt;h3 id=&quot;origin-and-meaning&quot;&gt;Origin and Meaning&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Greek for “pilot” or “helmsman of a ship” (κυβερνήτης)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Abbreviation&lt;/strong&gt;: K8s (K + 8 letters + s)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Created by&lt;/strong&gt;: Google, based on internal Borg system&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Open-sourced&lt;/strong&gt;: 2014&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Governed by&lt;/strong&gt;: Cloud Native Computing Foundation (CNCF)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;core-philosophy-self-healing&quot;&gt;Core Philosophy: Self-Healing&lt;/h3&gt;

&lt;p&gt;Kubernetes &lt;strong&gt;always tries to steer the cluster to its desired state&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;You: &quot;I want 3 healthy instances of redis to always be running.&quot;

Kubernetes: &quot;Okay, I&apos;ll ensure there are always 3 instances up and running.&quot;

[One instance dies]

Kubernetes: &quot;Oh look, one has died. I&apos;m going to spin up a new one.&quot;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is called the &lt;strong&gt;reconciliation loop&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Desired state&lt;/strong&gt;: What you want (3 redis instances)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Current state&lt;/strong&gt;: What actually exists (2 redis instances)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Action&lt;/strong&gt;: Kubernetes takes action to match desired state (create 1 instance)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Repeat&lt;/strong&gt;: Continuously monitor and adjust&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;kubernetes-architecture&quot;&gt;Kubernetes Architecture&lt;/h2&gt;

&lt;p&gt;Kubernetes uses a &lt;strong&gt;master-worker architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│  Control Plane (Master Node)                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │ API Server   │  │  Scheduler   │  │  Controller  │  │
│  │              │  │              │  │   Manager    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │  etcd (Distributed Key-Value Store)              │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        │                 │                 │
┌───────▼────────┐ ┌──────▼───────┐ ┌──────▼───────┐
│  Worker Node 1 │ │ Worker Node 2│ │ Worker Node 3│
│  ┌──────────┐  │ │  ┌──────────┐│ │  ┌──────────┐│
│  │ kubelet  │  │ │  │ kubelet  ││ │  │ kubelet  ││
│  ├──────────┤  │ │  ├──────────┤│ │  ├──────────┤│
│  │kube-proxy│  │ │  │kube-proxy││ │  │kube-proxy││
│  ├──────────┤  │ │  ├──────────┤│ │  ├──────────┤│
│  │Container │  │ │  │Container ││ │  │Container ││
│  │ Runtime  │  │ │  │ Runtime  ││ │  │ Runtime  ││
│  └──────────┘  │ │  └──────────┘│ │  └──────────┘│
│  [Pods...]     │ │  [Pods...]   │ │  [Pods...]   │
└────────────────┘ └──────────────┘ └──────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;control-plane-components&quot;&gt;Control Plane Components&lt;/h3&gt;

&lt;h4 id=&quot;1-api-server&quot;&gt;1. API Server&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Central management&lt;/strong&gt; point for the cluster&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;RESTful API&lt;/strong&gt; for all operations&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt; and authorization&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Validates&lt;/strong&gt; and processes API requests&lt;/li&gt;
  &lt;li&gt;Only component that talks to etcd&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-etcd&quot;&gt;2. etcd&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Distributed key-value store&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Stores &lt;strong&gt;all cluster state&lt;/strong&gt; and configuration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Highly available&lt;/strong&gt; and consistent&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Source of truth&lt;/strong&gt; for cluster state&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-scheduler&quot;&gt;3. Scheduler&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Watches&lt;/strong&gt; for newly created pods with no assigned node&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Selects&lt;/strong&gt; a node for the pod to run on&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Considers&lt;/strong&gt; resource requirements, constraints, affinity rules&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Does not&lt;/strong&gt; actually start the pod (kubelet does that)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-controller-manager&quot;&gt;4. Controller Manager&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Runs &lt;strong&gt;controller processes&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Each controller watches for changes and takes action&lt;/li&gt;
  &lt;li&gt;Examples:
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Node Controller&lt;/strong&gt;: Monitors node health&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Replication Controller&lt;/strong&gt;: Maintains correct number of pods&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Endpoints Controller&lt;/strong&gt;: Populates endpoint objects&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Service Account Controller&lt;/strong&gt;: Creates default accounts&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;worker-node-components&quot;&gt;Worker Node Components&lt;/h3&gt;

&lt;h4 id=&quot;1-kubelet&quot;&gt;1. kubelet&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Agent&lt;/strong&gt; running on each node&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ensures&lt;/strong&gt; containers are running in pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Communicates&lt;/strong&gt; with API server&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reports&lt;/strong&gt; node and pod status&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Executes&lt;/strong&gt; pod specifications (PodSpecs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-kube-proxy&quot;&gt;2. kube-proxy&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Network proxy&lt;/strong&gt; running on each node&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Maintains&lt;/strong&gt; network rules&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enables&lt;/strong&gt; communication to pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Implements&lt;/strong&gt; Service abstraction&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balances&lt;/strong&gt; across pod backends&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-container-runtime&quot;&gt;3. Container Runtime&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Software&lt;/strong&gt; responsible for running containers&lt;/li&gt;
  &lt;li&gt;Examples: &lt;strong&gt;Docker&lt;/strong&gt;, &lt;strong&gt;containerd&lt;/strong&gt;, &lt;strong&gt;CRI-O&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Implements &lt;strong&gt;Container Runtime Interface (CRI)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;core-concepts&quot;&gt;Core Concepts&lt;/h2&gt;

&lt;h3 id=&quot;desired-state-management&quot;&gt;Desired State Management&lt;/h3&gt;

&lt;p&gt;Kubernetes operates on a &lt;strong&gt;declarative model&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# You declare what you want&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx-deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# I want 3 instances&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx:1.21&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Kubernetes &lt;strong&gt;continuously works&lt;/strong&gt; to achieve and maintain this state:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If a pod crashes → Start a new one&lt;/li&gt;
  &lt;li&gt;If a node fails → Reschedule pods to healthy nodes&lt;/li&gt;
  &lt;li&gt;If you update the image → Rolling update to new version&lt;/li&gt;
  &lt;li&gt;If load increases → Scale up (with autoscaling)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;reconciliation-loop&quot;&gt;Reconciliation Loop&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  1. Observe current state               │
│     (What is actually running?)         │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│  2. Compare with desired state          │
│     (What should be running?)           │
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│  3. Take action to reconcile            │
│     (Create, update, or delete)         │
└────────────┬────────────────────────────┘
             │
             ▼
        [Repeat continuously]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This creates a &lt;strong&gt;goal-driven, self-healing system&lt;/strong&gt;.&lt;/p&gt;

&lt;h2 id=&quot;kubernetes-objects&quot;&gt;Kubernetes Objects&lt;/h2&gt;

&lt;p&gt;Everything in Kubernetes is represented as an &lt;strong&gt;object&lt;/strong&gt;:&lt;/p&gt;

&lt;h3 id=&quot;object-structure&quot;&gt;Object Structure&lt;/h3&gt;

&lt;p&gt;All Kubernetes objects have:&lt;/p&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;v1&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# API version&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Pod&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;# Object type&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;# Object metadata&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;my-pod&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;# Desired state&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;common-object-types&quot;&gt;Common Object Types&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Object&lt;/th&gt;
      &lt;th&gt;Purpose&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Pod&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Smallest deployable unit, runs containers&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Service&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Stable network endpoint for pods&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Manages pod replicas and updates&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;ReplicaSet&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Ensures desired number of pod replicas&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;ConfigMap&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Configuration data&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Secret&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Sensitive data (passwords, tokens)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Namespace&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Virtual cluster for isolation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;PersistentVolume&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Storage resource&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;working-with-kubernetes&quot;&gt;Working with Kubernetes&lt;/h2&gt;

&lt;h3 id=&quot;kubectl---the-kubernetes-cli&quot;&gt;kubectl - The Kubernetes CLI&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;kubectl&lt;/strong&gt; is the command-line tool for Kubernetes:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Get cluster information&lt;/span&gt;
kubectl cluster-info

&lt;span class=&quot;c&quot;&gt;# List nodes&lt;/span&gt;
kubectl get nodes

&lt;span class=&quot;c&quot;&gt;# List all pods&lt;/span&gt;
kubectl get pods &lt;span class=&quot;nt&quot;&gt;--all-namespaces&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Create resources from YAML&lt;/span&gt;
kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; deployment.yaml

&lt;span class=&quot;c&quot;&gt;# Get detailed information&lt;/span&gt;
kubectl describe pod my-pod

&lt;span class=&quot;c&quot;&gt;# View logs&lt;/span&gt;
kubectl logs my-pod

&lt;span class=&quot;c&quot;&gt;# Execute command in pod&lt;/span&gt;
kubectl &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; my-pod &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; /bin/bash

&lt;span class=&quot;c&quot;&gt;# Delete resources&lt;/span&gt;
kubectl delete pod my-pod
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;imperative-vs-declarative&quot;&gt;Imperative vs Declarative&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Imperative&lt;/strong&gt; (tell Kubernetes what to do):&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl create deployment nginx &lt;span class=&quot;nt&quot;&gt;--image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nginx
kubectl scale deployment nginx &lt;span class=&quot;nt&quot;&gt;--replicas&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3
kubectl expose deployment nginx &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;80
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Declarative&lt;/strong&gt; (tell Kubernetes what you want):&lt;/p&gt;
&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# deployment.yaml&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;nginx&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;# ... rest of spec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kubectl apply &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; deployment.yaml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;blockquote&gt;
  &lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Use declarative approach for production (version control, reproducibility)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;kubernetes-distributions&quot;&gt;Kubernetes Distributions&lt;/h2&gt;

&lt;p&gt;Kubernetes can run in many environments:&lt;/p&gt;

&lt;h3 id=&quot;cloud-managed-kubernetes&quot;&gt;Cloud-Managed Kubernetes&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Amazon EKS&lt;/strong&gt; (Elastic Kubernetes Service)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Google GKE&lt;/strong&gt; (Google Kubernetes Engine)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure AKS&lt;/strong&gt; (Azure Kubernetes Service)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DigitalOcean Kubernetes&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;: Managed control plane, automatic updates, integrated with cloud services&lt;/p&gt;

&lt;h3 id=&quot;self-managed-kubernetes&quot;&gt;Self-Managed Kubernetes&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;kubeadm&lt;/strong&gt;: Official tool for cluster setup&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;kops&lt;/strong&gt;: Kubernetes Operations (AWS focused)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Kubespray&lt;/strong&gt;: Ansible-based deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;: Full control, can run anywhere&lt;/p&gt;

&lt;h3 id=&quot;local-development&quot;&gt;Local Development&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;minikube&lt;/strong&gt;: Single-node cluster for local development&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;kind&lt;/strong&gt;: Kubernetes in Docker&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;k3s&lt;/strong&gt;: Lightweight Kubernetes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker Desktop&lt;/strong&gt;: Includes Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;benefits-of-kubernetes&quot;&gt;Benefits of Kubernetes&lt;/h2&gt;

&lt;p&gt;✓ &lt;strong&gt;Portability&lt;/strong&gt;: Run anywhere (cloud, on-prem, hybrid)&lt;br /&gt;
✓ &lt;strong&gt;Scalability&lt;/strong&gt;: Handle massive scale (Google runs billions of containers)&lt;br /&gt;
✓ &lt;strong&gt;High Availability&lt;/strong&gt;: Automatic failover and recovery&lt;br /&gt;
✓ &lt;strong&gt;Resource Efficiency&lt;/strong&gt;: Optimal bin-packing of containers&lt;br /&gt;
✓ &lt;strong&gt;Declarative Configuration&lt;/strong&gt;: Infrastructure as code&lt;br /&gt;
✓ &lt;strong&gt;Extensibility&lt;/strong&gt;: Plugin architecture, custom resources&lt;br /&gt;
✓ &lt;strong&gt;Ecosystem&lt;/strong&gt;: Huge community, tools, and integrations&lt;/p&gt;

&lt;h2 id=&quot;challenges&quot;&gt;Challenges&lt;/h2&gt;

&lt;p&gt;⚠ &lt;strong&gt;Complexity&lt;/strong&gt;: Steep learning curve&lt;br /&gt;
⚠ &lt;strong&gt;Overhead&lt;/strong&gt;: Requires resources for control plane&lt;br /&gt;
⚠ &lt;strong&gt;Overkill&lt;/strong&gt;: May be too complex for simple applications&lt;br /&gt;
⚠ &lt;strong&gt;Networking&lt;/strong&gt;: Can be complex to configure&lt;br /&gt;
⚠ &lt;strong&gt;Storage&lt;/strong&gt;: Stateful applications require careful planning&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Kubernetes is a powerful container orchestration platform that:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Automates&lt;/strong&gt; deployment, scaling, and management of containers&lt;/li&gt;
  &lt;li&gt;Uses a &lt;strong&gt;master-worker architecture&lt;/strong&gt; with control plane and worker nodes&lt;/li&gt;
  &lt;li&gt;Operates on &lt;strong&gt;desired state&lt;/strong&gt; and &lt;strong&gt;reconciliation loops&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Provides &lt;strong&gt;self-healing&lt;/strong&gt; and &lt;strong&gt;auto-scaling&lt;/strong&gt; capabilities&lt;/li&gt;
  &lt;li&gt;Has become the &lt;strong&gt;industry standard&lt;/strong&gt; for container orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next lecture, we’ll dive deep into &lt;strong&gt;Pods&lt;/strong&gt; - the fundamental building block of Kubernetes applications.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/&quot;&gt;Kubernetes Official Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/concepts/architecture/&quot;&gt;Kubernetes Architecture&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/reference/kubectl/cheatsheet/&quot;&gt;kubectl Cheat Sheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>Chapter 08 - Virtualization and Containerization</title>
   <link href="https://nglelinh.github.io/contents/en/chapter08/08_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter08/08_Introduction</id>
   <content type="html">&lt;p&gt;Welcome to Chapter 08, where we explore &lt;strong&gt;Virtualization and Containerization&lt;/strong&gt; - two fundamental technologies that have revolutionized modern computing and cloud infrastructure.&lt;/p&gt;

&lt;h2 id=&quot;chapter-overview&quot;&gt;Chapter Overview&lt;/h2&gt;

&lt;p&gt;This chapter covers the evolution from traditional bare-metal servers to virtual machines and lightweight containers. You’ll learn how these technologies enable efficient resource utilization, application isolation, and scalable cloud deployments.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;By the end of this chapter, you will be able to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Understand virtualization concepts&lt;/strong&gt; including hypervisors, virtual machines, and their architecture&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Differentiate between Type 1 and Type 2 hypervisors&lt;/strong&gt; and their use cases&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Explain containerization&lt;/strong&gt; and how it differs from traditional virtualization&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Work with Linux namespaces and cgroups&lt;/strong&gt; - the building blocks of containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use Docker&lt;/strong&gt; to create, manage, and deploy containerized applications&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Understand serverless computing&lt;/strong&gt; and AWS Lambda fundamentals&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;topics-covered&quot;&gt;Topics Covered&lt;/h2&gt;

&lt;h3 id=&quot;1-virtualization-fundamentals&quot;&gt;1. Virtualization Fundamentals&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;History and evolution of virtualization&lt;/li&gt;
  &lt;li&gt;Virtual Machine Monitors (VMM) and Hypervisors&lt;/li&gt;
  &lt;li&gt;Type 1 vs Type 2 virtualization&lt;/li&gt;
  &lt;li&gt;Properties of virtual machines: partitioning, isolation, encapsulation, hardware independence&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-virtualization-internals&quot;&gt;2. Virtualization Internals&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;How virtualization works: binary translation and dynamic translation&lt;/li&gt;
  &lt;li&gt;VM components and architecture&lt;/li&gt;
  &lt;li&gt;AWS VM implementations (Xen, Nitro, bare metal)&lt;/li&gt;
  &lt;li&gt;Security considerations in virtualized environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-containerization&quot;&gt;3. Containerization&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Motivation for containers and lightweight virtualization&lt;/li&gt;
  &lt;li&gt;Containers vs Virtual Machines&lt;/li&gt;
  &lt;li&gt;Linux namespaces: PID, mount, network, UTS, user, IPC&lt;/li&gt;
  &lt;li&gt;Control groups (cgroups) for resource management&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-container-technologies&quot;&gt;4. Container Technologies&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Docker architecture and concepts&lt;/li&gt;
  &lt;li&gt;Docker images, containers, and volumes&lt;/li&gt;
  &lt;li&gt;Dockerfile and container creation&lt;/li&gt;
  &lt;li&gt;Container orchestration with Kubernetes and Docker Swarm&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-serverless-computing&quot;&gt;5. Serverless Computing&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Introduction to serverless architectures&lt;/li&gt;
  &lt;li&gt;AWS Lambda and function-as-a-service (FaaS)&lt;/li&gt;
  &lt;li&gt;API Gateway integration&lt;/li&gt;
  &lt;li&gt;Serverless application patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-this-matters&quot;&gt;Why This Matters&lt;/h2&gt;

&lt;p&gt;Virtualization and containerization are the backbone of modern cloud computing:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Share physical resources among multiple workloads&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Agility&lt;/strong&gt;: Deploy and scale applications in seconds instead of weeks&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Portability&lt;/strong&gt;: Run applications consistently across different environments&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Secure separation between applications and users&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DevOps&lt;/strong&gt;: Enable continuous integration and deployment pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h2&gt;

&lt;p&gt;To get the most out of this chapter, you should be familiar with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Basic Linux/Unix command line operations&lt;/li&gt;
  &lt;li&gt;Operating system concepts (processes, memory, networking)&lt;/li&gt;
  &lt;li&gt;Cloud computing fundamentals from previous chapters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;practical-applications&quot;&gt;Practical Applications&lt;/h2&gt;

&lt;p&gt;Throughout this chapter, you’ll see real-world examples including:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Setting up virtual machines on cloud platforms&lt;/li&gt;
  &lt;li&gt;Creating Docker containers for applications&lt;/li&gt;
  &lt;li&gt;Understanding how AWS and other cloud providers use these technologies&lt;/li&gt;
  &lt;li&gt;Building serverless functions with AWS Lambda&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s begin our journey into the world of virtualization and containerization!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>08-04 Serverless Computing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter08/08_04_Serverless/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter08/08_04_Serverless</id>
   <content type="html">&lt;p&gt;Serverless computing represents the next evolution in cloud abstraction, where developers focus purely on code while the cloud provider manages all infrastructure, including servers, containers, and scaling.&lt;/p&gt;

&lt;h2 id=&quot;what-is-serverless-computing&quot;&gt;What is Serverless Computing?&lt;/h2&gt;

&lt;h3 id=&quot;definition&quot;&gt;Definition&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Serverless computing&lt;/strong&gt; is a cloud execution model where:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;You don’t manage servers or containers&lt;/li&gt;
  &lt;li&gt;Code runs in response to events&lt;/li&gt;
  &lt;li&gt;You pay only for actual execution time&lt;/li&gt;
  &lt;li&gt;Infrastructure scales automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!NOTE]
&lt;strong&gt;“Serverless” doesn’t mean no servers&lt;/strong&gt;&lt;/p&gt;

  &lt;p&gt;Servers still exist, but they’re completely abstracted away. You never see, configure, or manage them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;key-characteristics&quot;&gt;Key Characteristics&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No server management&lt;/strong&gt;: No OS patches, no capacity planning&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Event-driven&lt;/strong&gt;: Code executes in response to triggers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Automatic scaling&lt;/strong&gt;: From zero to thousands of concurrent executions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pay-per-use&lt;/strong&gt;: Billed by execution time (milliseconds) and memory&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Stateless&lt;/strong&gt;: Each function execution is independent&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;serverless-vs-containers-vs-vms&quot;&gt;Serverless vs Containers vs VMs&lt;/h2&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌──────────────────────────────────────────────────┐
│  Traditional VMs                                 │
│  You manage: OS, runtime, scaling, patching     │
│  Granularity: Hours                              │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Containers (Docker, Kubernetes)                 │
│  You manage: Container images, orchestration     │
│  Granularity: Seconds to minutes                 │
└──────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────┐
│  Serverless (AWS Lambda, Azure Functions)        │
│  You manage: Just your code                      │
│  Granularity: Milliseconds                       │
└──────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Aspect&lt;/th&gt;
      &lt;th&gt;VMs&lt;/th&gt;
      &lt;th&gt;Containers&lt;/th&gt;
      &lt;th&gt;Serverless&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Management&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Full control&lt;/td&gt;
      &lt;td&gt;Container orchestration&lt;/td&gt;
      &lt;td&gt;None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Scaling&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Manual/Auto-scaling groups&lt;/td&gt;
      &lt;td&gt;Kubernetes/Swarm&lt;/td&gt;
      &lt;td&gt;Automatic&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Billing&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Per hour/month&lt;/td&gt;
      &lt;td&gt;Per hour&lt;/td&gt;
      &lt;td&gt;Per 100ms&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Startup&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Minutes&lt;/td&gt;
      &lt;td&gt;Seconds&lt;/td&gt;
      &lt;td&gt;Milliseconds&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Idle Cost&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Always paying&lt;/td&gt;
      &lt;td&gt;Paying for running containers&lt;/td&gt;
      &lt;td&gt;Zero&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;examples-of-serverless-platforms&quot;&gt;Examples of Serverless Platforms&lt;/h2&gt;

&lt;h3 id=&quot;aws-fargate&quot;&gt;AWS Fargate&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Serverless container platform&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Run containers without managing servers&lt;/li&gt;
  &lt;li&gt;Works with Amazon ECS and EKS&lt;/li&gt;
  &lt;li&gt;You define container specs, AWS handles infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;aws-lambda&quot;&gt;AWS Lambda&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Function-as-a-Service (FaaS)&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Run code without provisioning servers&lt;/li&gt;
  &lt;li&gt;Event-driven execution&lt;/li&gt;
  &lt;li&gt;Supports multiple languages&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;other-platforms&quot;&gt;Other Platforms&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Azure Functions&lt;/strong&gt;: Microsoft’s FaaS offering&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Google Cloud Functions&lt;/strong&gt;: Google’s FaaS&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cloudflare Workers&lt;/strong&gt;: Edge computing platform&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Vercel/Netlify Functions&lt;/strong&gt;: Frontend-focused serverless&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;aws-lambda-deep-dive&quot;&gt;AWS Lambda Deep Dive&lt;/h2&gt;

&lt;h3 id=&quot;what-is-aws-lambda&quot;&gt;What is AWS Lambda?&lt;/h3&gt;

&lt;p&gt;AWS Lambda removes container details and allows you to write &lt;strong&gt;functions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Upload code as a function&lt;/li&gt;
  &lt;li&gt;Configure triggers (events)&lt;/li&gt;
  &lt;li&gt;AWS handles everything else&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;supported-programming-environments&quot;&gt;Supported Programming Environments&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Node.js&lt;/strong&gt; (JavaScript/TypeScript)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Python&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Java&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;C# (.NET)&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Go&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ruby&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Custom runtimes&lt;/strong&gt; (via Lambda Layers)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;event-sources&quot;&gt;Event Sources&lt;/h3&gt;

&lt;p&gt;Functions can be triggered by various events:&lt;/p&gt;

&lt;h4 id=&quot;http-requests&quot;&gt;HTTP Requests&lt;/h4&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;API Gateway → Lambda Function
User makes HTTP request → Function executes → Returns response
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;aws-services&quot;&gt;AWS Services&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;S3&lt;/strong&gt;: File upload/delete&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: Database changes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SNS/SQS&lt;/strong&gt;: Message queues&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CloudWatch&lt;/strong&gt;: Scheduled events (cron)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Kinesis&lt;/strong&gt;: Stream processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;iot-and-mobile&quot;&gt;IoT and Mobile&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;IoT Core&lt;/strong&gt;: Device messages&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cognito&lt;/strong&gt;: User authentication events&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Mobile&lt;/strong&gt;: App backend logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;lambda-function-structure&quot;&gt;Lambda Function Structure&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Python example:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lambda_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;
    event: Contains data about the triggering event
    context: Provides runtime information
    &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Your code here
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;World&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;statusCode&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Hello, &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Node.js example:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;exports&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;handler&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;async &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kd&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;World&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;statusCode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;JSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;stringify&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;
            &lt;span class=&quot;na&quot;&gt;message&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;`Hello, &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;!`&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;creating-a-lambda-function&quot;&gt;Creating a Lambda Function&lt;/h3&gt;

&lt;h4 id=&quot;1-choose-language&quot;&gt;1. Choose Language&lt;/h4&gt;
&lt;p&gt;Select from supported runtimes (Python 3.11, Node.js 18, etc.)&lt;/p&gt;

&lt;h4 id=&quot;2-define-role&quot;&gt;2. Define Role&lt;/h4&gt;
&lt;p&gt;IAM role with permissions for:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;CloudWatch Logs (logging)&lt;/li&gt;
  &lt;li&gt;Other AWS services your function needs&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-select-template-optional&quot;&gt;3. Select Template (Optional)&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Blank function&lt;/li&gt;
  &lt;li&gt;API Gateway proxy&lt;/li&gt;
  &lt;li&gt;S3 object processing&lt;/li&gt;
  &lt;li&gt;DynamoDB stream processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;4-writeupload-code&quot;&gt;4. Write/Upload Code&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Option A: Inline editor&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Edit directly in AWS Console
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lambda_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;statusCode&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Hello!&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Option B: Upload ZIP&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Package function with dependencies&lt;/span&gt;
pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;requests &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
zip &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;function&lt;/span&gt;.zip &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Upload via console or CLI&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Option C: Container image&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; public.ecr.aws/lambda/python:3.11&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; app.py ${LAMBDA_TASK_ROOT}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;app.lambda_handler&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;5-configure-settings&quot;&gt;5. Configure Settings&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: 128 MB to 10 GB&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Timeout&lt;/strong&gt;: Up to 15 minutes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Environment variables&lt;/strong&gt;: Configuration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;VPC&lt;/strong&gt;: Optional network access&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;6-test&quot;&gt;6. Test&lt;/h4&gt;

&lt;p&gt;Use test events to invoke function:&lt;/p&gt;
&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Alice&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;action&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;greet&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;local-development-with-sam-cli&quot;&gt;Local Development with SAM CLI&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AWS SAM (Serverless Application Model)&lt;/strong&gt; enables local testing:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Install SAM CLI&lt;/span&gt;
brew &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;aws-sam-cli

&lt;span class=&quot;c&quot;&gt;# Initialize project&lt;/span&gt;
sam init

&lt;span class=&quot;c&quot;&gt;# Test locally&lt;/span&gt;
sam &lt;span class=&quot;nb&quot;&gt;local &lt;/span&gt;invoke MyFunction &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; event.json

&lt;span class=&quot;c&quot;&gt;# Start local API&lt;/span&gt;
sam &lt;span class=&quot;nb&quot;&gt;local &lt;/span&gt;start-api

&lt;span class=&quot;c&quot;&gt;# Deploy to AWS&lt;/span&gt;
sam deploy &lt;span class=&quot;nt&quot;&gt;--guided&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;api-gateway-integration&quot;&gt;API Gateway Integration&lt;/h2&gt;

&lt;h3 id=&quot;purpose&quot;&gt;Purpose&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API Gateway&lt;/strong&gt; provides a simple way to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Associate URLs with Lambda functions&lt;/li&gt;
  &lt;li&gt;Create RESTful APIs&lt;/li&gt;
  &lt;li&gt;Handle HTTP methods (GET, POST, PUT, DELETE)&lt;/li&gt;
  &lt;li&gt;Enable CORS (Cross-Origin Resource Sharing)&lt;/li&gt;
  &lt;li&gt;Manage API keys and throttling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-setup&quot;&gt;Example Setup&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│  Client (Browser/Mobile)                    │
└────────────────┬────────────────────────────┘
                 │ HTTPS Request
                 ↓
┌─────────────────────────────────────────────┐
│  API Gateway                                │
│  GET  /users      → Lambda: listUsers       │
│  POST /users      → Lambda: createUser      │
│  GET  /users/{id} → Lambda: getUser         │
└────────────────┬────────────────────────────┘
                 │ Invoke
                 ↓
┌─────────────────────────────────────────────┐
│  Lambda Functions                           │
│  - Parse request                            │
│  - Business logic                           │
│  - Return response                          │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;api-gateway-features&quot;&gt;API Gateway Features&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Request validation&lt;/strong&gt;: Check parameters before invoking Lambda&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Response transformation&lt;/strong&gt;: Modify responses&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Caching&lt;/strong&gt;: Cache responses for performance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Throttling&lt;/strong&gt;: Rate limiting&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: API keys, IAM, Cognito, custom authorizers&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;typical-serverless-architecture&quot;&gt;Typical Serverless Architecture&lt;/h2&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌──────────────┐
│   CloudFront │  CDN for static content
│   (CDN)      │
└──────┬───────┘
       │
┌──────▼───────┐
│   S3 Bucket  │  Static website hosting
│   (Frontend) │  (HTML, CSS, JS)
└──────┬───────┘
       │ API calls
┌──────▼───────────┐
│  API Gateway     │  RESTful API
└──────┬───────────┘
       │
┌──────▼───────────┐
│  Lambda Functions│  Business logic
│  - Auth          │
│  - CRUD ops      │
│  - Processing    │
└──────┬───────────┘
       │
┌──────▼───────────┐
│  DynamoDB        │  NoSQL database
│  (Database)      │
└──────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;example-serverless-web-application&quot;&gt;Example: Serverless Web Application&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Components:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;S3 + CloudFront&lt;/strong&gt;: Host static frontend&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;API Gateway&lt;/strong&gt;: Expose REST API&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lambda&lt;/strong&gt;: Handle business logic&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;: Store data&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cognito&lt;/strong&gt;: User authentication&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CloudWatch&lt;/strong&gt;: Logging and monitoring&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;lambda-pricing&quot;&gt;Lambda Pricing&lt;/h2&gt;

&lt;h3 id=&quot;cost-model&quot;&gt;Cost Model&lt;/h3&gt;

&lt;p&gt;You pay for:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Number of requests&lt;/strong&gt;: $0.20 per 1M requests&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Duration&lt;/strong&gt;: $0.0000166667 per GB-second&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;example-calculation&quot;&gt;Example Calculation&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Function specs:
- Memory: 512 MB (0.5 GB)
- Execution time: 200ms (0.2 seconds)
- Requests: 1 million per month

Cost calculation:
Requests: 1M × $0.20/1M = $0.20
Duration: 1M × 0.5 GB × 0.2s × $0.0000166667 = $1.67

Total: $1.87/month
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;free-tier&quot;&gt;Free Tier&lt;/h3&gt;

&lt;p&gt;AWS Lambda free tier includes:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;1M free requests&lt;/strong&gt; per month&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;400,000 GB-seconds&lt;/strong&gt; of compute time per month&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;benefits-of-serverless&quot;&gt;Benefits of Serverless&lt;/h2&gt;

&lt;p&gt;✓ &lt;strong&gt;No infrastructure management&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;No servers to provision or maintain&lt;/li&gt;
  &lt;li&gt;No OS patches or updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Automatic scaling&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Scales from 0 to thousands instantly&lt;/li&gt;
  &lt;li&gt;Handles traffic spikes automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Cost-effective&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Pay only for actual usage&lt;/li&gt;
  &lt;li&gt;No idle capacity costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Faster time to market&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Focus on code, not infrastructure&lt;/li&gt;
  &lt;li&gt;Rapid deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Built-in high availability&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Multi-AZ deployment by default&lt;/li&gt;
  &lt;li&gt;Automatic failover&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;challenges-and-considerations&quot;&gt;Challenges and Considerations&lt;/h2&gt;

&lt;h3 id=&quot;cold-starts&quot;&gt;Cold Starts&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Problem&lt;/strong&gt;: First invocation after idle period is slow&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Impact&lt;/strong&gt;: 100ms - 1s latency&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Solutions&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Keep functions warm with scheduled pings&lt;/li&gt;
      &lt;li&gt;Use provisioned concurrency&lt;/li&gt;
      &lt;li&gt;Optimize package size&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;execution-limits&quot;&gt;Execution Limits&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Timeout&lt;/strong&gt;: Maximum 15 minutes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: Maximum 10 GB&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deployment package&lt;/strong&gt;: 50 MB (zipped), 250 MB (unzipped)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Concurrent executions&lt;/strong&gt;: 1000 (default, can be increased)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;vendor-lock-in&quot;&gt;Vendor Lock-in&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Code often tied to specific cloud provider&lt;/li&gt;
  &lt;li&gt;Migration can be complex&lt;/li&gt;
  &lt;li&gt;Consider using frameworks like Serverless Framework or SAM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;debugging-and-monitoring&quot;&gt;Debugging and Monitoring&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Harder to debug than traditional apps&lt;/li&gt;
  &lt;li&gt;Rely on logging (CloudWatch Logs)&lt;/li&gt;
  &lt;li&gt;Use distributed tracing (X-Ray)&lt;/li&gt;
  &lt;li&gt;Monitor cold starts and errors&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;microservices-and-serverless&quot;&gt;Microservices and Serverless&lt;/h2&gt;

&lt;p&gt;Serverless fits perfectly with &lt;strong&gt;microservices architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Monolithic Application                 │
│  - One large codebase                   │
│  - Deployed as single unit              │
│  - Scales entire application            │
└─────────────────────────────────────────┘

                  ↓ Decompose

┌──────────┐  ┌──────────┐  ┌──────────┐
│ User     │  │ Product  │  │ Order    │
│ Service  │  │ Service  │  │ Service  │
│ (Lambda) │  │ (Lambda) │  │ (Lambda) │
└──────────┘  └──────────┘  └──────────┘
     ↓             ↓             ↓
┌──────────┐  ┌──────────┐  ┌──────────┐
│ User DB  │  │Product DB│  │ Order DB │
└──────────┘  └──────────┘  └──────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Independent deployment&lt;/li&gt;
  &lt;li&gt;Technology diversity&lt;/li&gt;
  &lt;li&gt;Fault isolation&lt;/li&gt;
  &lt;li&gt;Easier scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;best-practices&quot;&gt;Best Practices&lt;/h2&gt;

&lt;h3 id=&quot;1-keep-functions-small-and-focused&quot;&gt;1. Keep Functions Small and Focused&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# ❌ BAD: One function does everything
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lambda_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;create_user&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# 100 lines of code
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;delete_user&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# 100 lines of code
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# ...
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# ✅ GOOD: Separate functions
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;create_user&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Focused logic
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;delete_user&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Focused logic
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-minimize-package-size&quot;&gt;2. Minimize Package Size&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Use layers for shared dependencies&lt;/li&gt;
  &lt;li&gt;Remove unnecessary files&lt;/li&gt;
  &lt;li&gt;Use lightweight libraries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-use-environment-variables&quot;&gt;3. Use Environment Variables&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;DB_HOST&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;environ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;DB_HOST&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;API_KEY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;environ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;API_KEY&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;4-implement-proper-error-handling&quot;&gt;4. Implement Proper Error Handling&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lambda_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Your code
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;statusCode&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Success&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Error: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;statusCode&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Error&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;5-use-dead-letter-queues&quot;&gt;5. Use Dead Letter Queues&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Handle failed executions&lt;/li&gt;
  &lt;li&gt;Retry logic&lt;/li&gt;
  &lt;li&gt;Alert on failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Serverless computing represents the highest level of cloud abstraction:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;No infrastructure management&lt;/strong&gt;: Focus purely on code&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Event-driven&lt;/strong&gt;: Functions respond to triggers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt;: Handle any load automatically&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cost-effective&lt;/strong&gt;: Pay only for execution time&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;AWS Lambda&lt;/strong&gt;: Leading FaaS platform&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Perfect for microservices&lt;/strong&gt;: Independent, scalable functions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Serverless is ideal for:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;APIs and web backends&lt;/li&gt;
  &lt;li&gt;Data processing pipelines&lt;/li&gt;
  &lt;li&gt;Scheduled tasks&lt;/li&gt;
  &lt;li&gt;Event-driven workflows&lt;/li&gt;
  &lt;li&gt;IoT backends&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/lambda/&quot;&gt;AWS Lambda Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.aws.amazon.com/serverless-application-model/&quot;&gt;AWS SAM Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.serverless.com/&quot;&gt;Serverless Framework&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://medium.com/@amiram_26122/the-hidden-costs-of-serverless-6ced7844780b&quot;&gt;The Hidden Costs of Serverless&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>08-03 Docker and Container Orchestration</title>
   <link href="https://nglelinh.github.io/contents/en/chapter08/08_03_Docker/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter08/08_03_Docker</id>
   <content type="html">&lt;p&gt;Docker has revolutionized application deployment by making containers accessible and easy to use. This lecture covers Docker concepts, architecture, and container orchestration frameworks.&lt;/p&gt;

&lt;h2 id=&quot;docker-concepts&quot;&gt;Docker Concepts&lt;/h2&gt;

&lt;h3 id=&quot;core-components&quot;&gt;Core Components&lt;/h3&gt;

&lt;h4 id=&quot;1-image&quot;&gt;1. Image&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Frozen description&lt;/strong&gt; of an environment&lt;/li&gt;
  &lt;li&gt;Read-only template containing:
    &lt;ul&gt;
      &lt;li&gt;Base operating system files&lt;/li&gt;
      &lt;li&gt;Application code&lt;/li&gt;
      &lt;li&gt;Dependencies and libraries&lt;/li&gt;
      &lt;li&gt;Configuration files&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Stored in layers for efficiency&lt;/li&gt;
  &lt;li&gt;Can be shared via Docker Hub or private registries&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-container&quot;&gt;2. Container&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Running instantiation&lt;/strong&gt; of an image&lt;/li&gt;
  &lt;li&gt;Writable layer on top of image&lt;/li&gt;
  &lt;li&gt;Isolated execution environment&lt;/li&gt;
  &lt;li&gt;Can be started, stopped, moved, and deleted&lt;/li&gt;
  &lt;li&gt;Ephemeral by default (state lost when removed)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-volume&quot;&gt;3. Volume&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Persistent data storage&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Survives container lifecycle&lt;/li&gt;
  &lt;li&gt;Can be shared between containers&lt;/li&gt;
  &lt;li&gt;Managed by Docker or mounted from host&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Image (Read-Only Template)         │
│  ├── Ubuntu base layer              │
│  ├── Python installation            │
│  ├── Application dependencies       │
│  └── Application code                │
└─────────────────────────────────────┘
           ↓ docker run
┌─────────────────────────────────────┐
│  Container (Running Instance)        │
│  ├── Writable layer                 │
│  └── All image layers (read-only)   │
└─────────────────────────────────────┘
           ↓ uses
┌─────────────────────────────────────┐
│  Volume (Persistent Data)            │
│  └── Database files, logs, etc.     │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;the-dockerfile&quot;&gt;The Dockerfile&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Dockerfile&lt;/strong&gt; describes everything your container needs:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Dependencies&lt;/li&gt;
  &lt;li&gt;Source code / binaries&lt;/li&gt;
  &lt;li&gt;Configuration&lt;/li&gt;
  &lt;li&gt;Startup command&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;overview-guide-to-running-code-in-docker&quot;&gt;Overview Guide to Running Code in Docker&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Inherit&lt;/strong&gt; from a parent OS/platform container&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Install&lt;/strong&gt; any packages/libraries you need&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Add&lt;/strong&gt; any source code you need&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Attach&lt;/strong&gt; any volumes for data persistence&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Set&lt;/strong&gt; a command to be run at startup&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;essential-dockerfile-commands&quot;&gt;Essential Dockerfile Commands&lt;/h3&gt;

&lt;h4 id=&quot;from---inherit-from-a-parent-container&quot;&gt;FROM - Inherit from a parent container&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; ubuntu:22.04&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# or&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.11-slim&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# or&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; node:18-alpine&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;run---execute-commands-during-build&quot;&gt;RUN - Execute commands during build&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    python3 &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    python3-pip &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; /var/lib/apt/lists/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;copy--add---copy-files-into-the-image&quot;&gt;COPY / ADD - Copy files into the image&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# COPY is preferred for simple file copying&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; app.py /usr/local/app/&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; requirements.txt /usr/local/app/&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# ADD can extract archives and download URLs&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ADD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; myapp.tar.gz /usr/local/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;workdir---set-working-directory&quot;&gt;WORKDIR - Set working directory&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; /usr/local/app&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;expose---register-ports-the-container-listens-on&quot;&gt;EXPOSE - Register ports the container listens on&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;EXPOSE&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; 80&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPOSE&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; 443&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;env---set-environment-variables&quot;&gt;ENV - Set environment variables&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; PYTHONUNBUFFERED=1&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; APP_ENV=production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;cmd---default-command-to-execute-on-startup&quot;&gt;CMD - Default command to execute on startup&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;/usr/local/app/app.py&quot;]&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# or&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;npm&quot;, &quot;start&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;entrypoint---configure-container-as-executable&quot;&gt;ENTRYPOINT - Configure container as executable&lt;/h4&gt;
&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;ENTRYPOINT&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;app.py&quot;]&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Arguments can be passed: docker run myimage arg1 arg2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;example-dockerfile&quot;&gt;Example Dockerfile&lt;/h3&gt;

&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Use official Python runtime as base&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.11-slim&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Set working directory&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; /app&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Copy requirements file&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; requirements.txt .&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Install dependencies&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--no-cache-dir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; requirements.txt

&lt;span class=&quot;c&quot;&gt;# Copy application code&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; . .&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Expose port&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;EXPOSE&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; 8000&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Set environment variables&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; PYTHONUNBUFFERED=1&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run the application&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;app.py&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;docker-build-process&quot;&gt;Docker Build Process&lt;/h3&gt;

&lt;p&gt;Each command in a Dockerfile creates an &lt;strong&gt;intermediate image layer&lt;/strong&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;FROM ubuntu:22.04          → Layer 1 (base)
RUN apt-get update         → Layer 2
RUN apt-get install python → Layer 3
COPY app.py /app/          → Layer 4
CMD [&quot;python&quot;, &quot;/app/app.py&quot;] → Layer 5
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Benefits of layering:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Caching&lt;/strong&gt;: Unchanged layers are reused&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;: Only modified layers are rebuilt&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Sharing&lt;/strong&gt;: Common layers shared between images&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;optimizing-dockerfiles-for-caching&quot;&gt;Optimizing Dockerfiles for Caching&lt;/h3&gt;

&lt;p&gt;Structure your Dockerfile to maximize cache hits:&lt;/p&gt;

&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# ❌ BAD: Code changes invalidate all layers&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.11&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; . /app&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; /app/requirements.txt
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;/app/app.py&quot;]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# ✅ GOOD: Dependencies cached separately&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.11&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; /app&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Install dependencies first (changes less frequently)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; requirements.txt .&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;pip &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; requirements.txt

&lt;span class=&quot;c&quot;&gt;# Copy code last (changes frequently)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;COPY&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; . .&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;app.py&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Principle&lt;/strong&gt;: “Funnel down” from most general to most specific&lt;/p&gt;

&lt;h2 id=&quot;docker-architecture&quot;&gt;Docker Architecture&lt;/h2&gt;

&lt;p&gt;Docker uses a client-server architecture:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────────────┐
│  Docker Client (CLI)                            │
│  $ docker run, docker build, docker pull        │
└────────────────┬────────────────────────────────┘
                 │ REST API
                 ↓
┌─────────────────────────────────────────────────┐
│  Docker Daemon (dockerd)                        │
│  ├── Manages images, containers, networks       │
│  ├── Handles build requests                     │
│  └── Communicates with registries               │
└────────────────┬────────────────────────────────┘
                 │
      ┌──────────┴──────────┬──────────────┐
      ↓                     ↓              ↓
┌──────────┐         ┌──────────┐   ┌──────────┐
│ Images   │         │Containers│   │ Networks │
└──────────┘         └──────────┘   └──────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;components&quot;&gt;Components&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Docker Daemon (dockerd)&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Long-running background process&lt;/li&gt;
      &lt;li&gt;Manages containers, images, networks, volumes&lt;/li&gt;
      &lt;li&gt;Listens for API requests&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker Client (docker CLI)&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Command-line interface&lt;/li&gt;
      &lt;li&gt;Sends commands to daemon via REST API&lt;/li&gt;
      &lt;li&gt;Can connect to remote daemons&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker Registry&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Stores Docker images&lt;/li&gt;
      &lt;li&gt;Docker Hub (public)&lt;/li&gt;
      &lt;li&gt;Private registries (AWS ECR, Google GCR, Azure ACR)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker Images&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Read-only templates&lt;/li&gt;
      &lt;li&gt;Built from Dockerfiles&lt;/li&gt;
      &lt;li&gt;Stored in layers&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker Containers&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Running instances of images&lt;/li&gt;
      &lt;li&gt;Isolated processes&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;basic-docker-commands&quot;&gt;Basic Docker Commands&lt;/h2&gt;

&lt;h3 id=&quot;working-with-images&quot;&gt;Working with Images&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Pull an image from registry&lt;/span&gt;
docker pull ubuntu:22.04

&lt;span class=&quot;c&quot;&gt;# List local images&lt;/span&gt;
docker images

&lt;span class=&quot;c&quot;&gt;# Build an image from Dockerfile&lt;/span&gt;
docker build &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; myapp:v1.0 &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Tag an image&lt;/span&gt;
docker tag myapp:v1.0 username/myapp:v1.0

&lt;span class=&quot;c&quot;&gt;# Push image to registry&lt;/span&gt;
docker push username/myapp:v1.0

&lt;span class=&quot;c&quot;&gt;# Remove an image&lt;/span&gt;
docker rmi myapp:v1.0

&lt;span class=&quot;c&quot;&gt;# Search Docker Hub for images&lt;/span&gt;
docker search nginx
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;working-with-containers&quot;&gt;Working with Containers&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Run a container&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 8080:80 &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; webserver nginx

&lt;span class=&quot;c&quot;&gt;# List running containers&lt;/span&gt;
docker ps

&lt;span class=&quot;c&quot;&gt;# List all containers (including stopped)&lt;/span&gt;
docker ps &lt;span class=&quot;nt&quot;&gt;-a&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Stop a container&lt;/span&gt;
docker stop webserver

&lt;span class=&quot;c&quot;&gt;# Start a stopped container&lt;/span&gt;
docker start webserver

&lt;span class=&quot;c&quot;&gt;# Restart a container&lt;/span&gt;
docker restart webserver

&lt;span class=&quot;c&quot;&gt;# Execute command in running container&lt;/span&gt;
docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; webserver bash

&lt;span class=&quot;c&quot;&gt;# View container logs&lt;/span&gt;
docker logs webserver

&lt;span class=&quot;c&quot;&gt;# Attach to running container&lt;/span&gt;
docker attach webserver

&lt;span class=&quot;c&quot;&gt;# Remove a container&lt;/span&gt;
docker &lt;span class=&quot;nb&quot;&gt;rm &lt;/span&gt;webserver

&lt;span class=&quot;c&quot;&gt;# Remove all stopped containers&lt;/span&gt;
docker container prune
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;advanced-operations&quot;&gt;Advanced Operations&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Run container with volume mount&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /host/path:/container/path myapp

&lt;span class=&quot;c&quot;&gt;# Run with environment variables&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DB_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;localhost &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DB_PORT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5432 myapp

&lt;span class=&quot;c&quot;&gt;# Run with resource limits&lt;/span&gt;
docker run &lt;span class=&quot;nt&quot;&gt;--memory&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;512m&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--cpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1.5&quot;&lt;/span&gt; myapp

&lt;span class=&quot;c&quot;&gt;# Export container filesystem&lt;/span&gt;
docker &lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;webserver &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; webserver.tar

&lt;span class=&quot;c&quot;&gt;# Create image from container changes&lt;/span&gt;
docker commit webserver myapp:v1.1

&lt;span class=&quot;c&quot;&gt;# Inspect container details&lt;/span&gt;
docker inspect webserver

&lt;span class=&quot;c&quot;&gt;# View container resource usage&lt;/span&gt;
docker stats
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;container-orchestration&quot;&gt;Container Orchestration&lt;/h2&gt;

&lt;p&gt;Managing multiple containers across multiple hosts requires orchestration:&lt;/p&gt;

&lt;h3 id=&quot;docker-swarm&quot;&gt;Docker Swarm&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Native Docker clustering&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Pools multiple Docker engines into a virtual host&lt;/li&gt;
  &lt;li&gt;Allows multiple VMs to collaborate&lt;/li&gt;
  &lt;li&gt;Built-in load balancing&lt;/li&gt;
  &lt;li&gt;Service discovery&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Initialize swarm&lt;/span&gt;
docker swarm init

&lt;span class=&quot;c&quot;&gt;# Deploy a service&lt;/span&gt;
docker service create &lt;span class=&quot;nt&quot;&gt;--replicas&lt;/span&gt; 3 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 80:80 nginx

&lt;span class=&quot;c&quot;&gt;# Scale a service&lt;/span&gt;
docker service scale &lt;span class=&quot;nv&quot;&gt;web&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;5

&lt;span class=&quot;c&quot;&gt;# List services&lt;/span&gt;
docker service &lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;docker-compose&quot;&gt;Docker Compose&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Multi-container application orchestration&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Declarative YAML format&lt;/li&gt;
  &lt;li&gt;Defines services, networks, volumes&lt;/li&gt;
  &lt;li&gt;Easy local development&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;docker-compose.yml example:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;na&quot;&gt;version&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;3.8&apos;&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;services&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;web&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;.&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;8000:8000&quot;&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;.:/app&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;DATABASE_URL=postgresql://db:5432/myapp&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;depends_on&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;db&lt;/span&gt;
  
  &lt;span class=&quot;na&quot;&gt;db&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;postgres:15&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;postgres_data:/var/lib/postgresql/data&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;environment&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;POSTGRES_PASSWORD=secret&lt;/span&gt;

&lt;span class=&quot;na&quot;&gt;volumes&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;postgres_data&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Start all services&lt;/span&gt;
docker-compose up &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Stop all services&lt;/span&gt;
docker-compose down

&lt;span class=&quot;c&quot;&gt;# View logs&lt;/span&gt;
docker-compose logs &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Scale a service&lt;/span&gt;
docker-compose up &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--scale&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;web&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;kubernetes&quot;&gt;Kubernetes&lt;/h3&gt;

&lt;p&gt;The most popular container orchestration platform:&lt;/p&gt;

&lt;h4 id=&quot;key-concepts&quot;&gt;Key Concepts&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Nodes&lt;/strong&gt;: Physical or virtual machines&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pods&lt;/strong&gt;: Group of one or more containers
    &lt;ul&gt;
      &lt;li&gt;Share same network namespace&lt;/li&gt;
      &lt;li&gt;Have the same IP address&lt;/li&gt;
      &lt;li&gt;Represent a tier of multi-tier app (frontend, backend, database)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Services&lt;/strong&gt;: Stable network endpoint for pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Deployments&lt;/strong&gt;: Declarative updates for pods&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;kubernetes-features&quot;&gt;Kubernetes Features&lt;/h4&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt;: Scale pods based on load&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Self-healing&lt;/strong&gt;: Restart failed containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Load balancing&lt;/strong&gt;: Distribute traffic across pods&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rolling updates&lt;/strong&gt;: Zero-downtime deployments&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Service discovery&lt;/strong&gt;: Automatic DNS for services&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-yaml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Example Kubernetes deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;apiVersion&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;apps/v1&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;kind&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;Deployment&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web-app&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;replicas&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;selector&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;matchLabels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;template&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;spec&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;na&quot;&gt;containers&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
      &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;web&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;myapp:v1.0&lt;/span&gt;
        &lt;span class=&quot;na&quot;&gt;ports&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;pi&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;containerPort&lt;/span&gt;&lt;span class=&quot;pi&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;what-containers-can-do&quot;&gt;What Containers CAN Do&lt;/h2&gt;

&lt;p&gt;✓ &lt;strong&gt;Run different Linux distributions&lt;/strong&gt; on the same host&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Example: Ubuntu container on Red Hat host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Run applications with different dependencies&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Example: Python 3.9 in one container, Python 3.11 in another&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Use the host’s hardware&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Access network interfaces&lt;/li&gt;
  &lt;li&gt;Access GPUs (with NVIDIA drivers)&lt;/li&gt;
  &lt;li&gt;Access storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Isolate applications&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Each container has its own filesystem&lt;/li&gt;
  &lt;li&gt;Process isolation&lt;/li&gt;
  &lt;li&gt;Network isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;best-practices&quot;&gt;Best Practices&lt;/h2&gt;

&lt;h3 id=&quot;image-best-practices&quot;&gt;Image Best Practices&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Use official base images&lt;/strong&gt; when possible&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Keep images small&lt;/strong&gt;: Use alpine or slim variants&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use specific tags&lt;/strong&gt;: Avoid &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;latest&lt;/code&gt; in production&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Minimize layers&lt;/strong&gt;: Combine RUN commands&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use .dockerignore&lt;/strong&gt;: Exclude unnecessary files&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Don’t run as root&lt;/strong&gt;: Use USER directive&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scan for vulnerabilities&lt;/strong&gt;: Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;docker scan&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;container-best-practices&quot;&gt;Container Best Practices&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;One process per container&lt;/strong&gt;: Follow microservices principle&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use volumes for data&lt;/strong&gt;: Don’t store data in containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Log to stdout/stderr&lt;/strong&gt;: Let Docker handle log management&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use health checks&lt;/strong&gt;: Enable automatic restart&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Set resource limits&lt;/strong&gt;: Prevent resource exhaustion&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use environment variables&lt;/strong&gt;: For configuration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Keep containers stateless&lt;/strong&gt;: Enable easy scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Docker simplifies container management:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Images&lt;/strong&gt; are templates, &lt;strong&gt;containers&lt;/strong&gt; are running instances&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Dockerfiles&lt;/strong&gt; define how to build images&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Docker architecture&lt;/strong&gt; uses client-server model&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Layering&lt;/strong&gt; enables efficient caching and sharing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Orchestration tools&lt;/strong&gt; (Swarm, Compose, Kubernetes) manage multiple containers&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Best practices&lt;/strong&gt; ensure secure, efficient deployments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next lecture, we’ll explore &lt;strong&gt;serverless computing&lt;/strong&gt; and how it abstracts away even container management.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.docker.com/&quot;&gt;Docker Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://hub.docker.com/&quot;&gt;Docker Hub&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/&quot;&gt;Kubernetes Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.docker.com/compose/&quot;&gt;Docker Compose Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>08-02 Containerization and Linux Primitives</title>
   <link href="https://nglelinh.github.io/contents/en/chapter08/08_02_Containerization/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter08/08_02_Containerization</id>
   <content type="html">&lt;p&gt;While virtual machines provide complete isolation by virtualizing entire operating systems, &lt;strong&gt;containers&lt;/strong&gt; offer a lightweight alternative that shares the OS kernel while providing isolated execution environments. This lecture explores containerization and the Linux primitives that make it possible.&lt;/p&gt;

&lt;h2 id=&quot;motivation-for-containerization&quot;&gt;Motivation for Containerization&lt;/h2&gt;

&lt;h3 id=&quot;limitations-of-vms&quot;&gt;Limitations of VMs&lt;/h3&gt;

&lt;p&gt;Virtual machines are powerful but have drawbacks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;High runtime overhead&lt;/strong&gt;: Each VM runs a complete OS&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Slow startup&lt;/strong&gt;: Booting an OS takes time&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource intensive&lt;/strong&gt;: Multiple OS copies consume significant memory&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Large image sizes&lt;/strong&gt;: VM images are typically gigabytes in size&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-container-solution&quot;&gt;The Container Solution&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What if we could sandbox applications but share the OS kernel?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Containers provide:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;✓ &lt;strong&gt;Faster scaling&lt;/strong&gt;: Start in seconds instead of minutes&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;Lower overhead&lt;/strong&gt;: Share kernel, only package app dependencies&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;Smaller footprint&lt;/strong&gt;: Container images are megabytes, not gigabytes&lt;/li&gt;
  &lt;li&gt;✓ &lt;strong&gt;Higher density&lt;/strong&gt;: Run more containers than VMs on same hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Different software architectures (microservices)&lt;/li&gt;
  &lt;li&gt;New development practices (DevOps, CI/CD)&lt;/li&gt;
  &lt;li&gt;More efficient resource utilization&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;containers-vs-virtual-machines&quot;&gt;Containers vs Virtual Machines&lt;/h2&gt;

&lt;h3 id=&quot;architecture-comparison&quot;&gt;Architecture Comparison&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Traditional VM-based Infrastructure:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│    App A     │  │    App B     │  │    App C     │
├──────────────┤  ├──────────────┤  ├──────────────┤
│   Bins/Libs  │  │   Bins/Libs  │  │   Bins/Libs  │
├──────────────┤  ├──────────────┤  ├──────────────┤
│   Guest OS   │  │   Guest OS   │  │   Guest OS   │
├──────────────┴──┴──────────────┴──┴──────────────┤
│              Hypervisor                           │
├───────────────────────────────────────────────────┤
│              Host OS (optional)                   │
├───────────────────────────────────────────────────┤
│              Physical Hardware                    │
└───────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Container Infrastructure:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│    App A     │  │    App B     │  │    App C     │
├──────────────┤  ├──────────────┤  ├──────────────┤
│   Bins/Libs  │  │   Bins/Libs  │  │   Bins/Libs  │
├──────────────┴──┴──────────────┴──┴──────────────┤
│           Container Runtime (Docker)              │
├───────────────────────────────────────────────────┤
│              Host OS / Kernel                     │
├───────────────────────────────────────────────────┤
│              Physical Hardware                    │
└───────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;key-differences&quot;&gt;Key Differences&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Aspect&lt;/th&gt;
      &lt;th&gt;Virtual Machines&lt;/th&gt;
      &lt;th&gt;Containers&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Complete OS per VM&lt;/td&gt;
      &lt;td&gt;Shared kernel&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Size&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Gigabytes&lt;/td&gt;
      &lt;td&gt;Megabytes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Startup&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Minutes&lt;/td&gt;
      &lt;td&gt;Seconds&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Isolation&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Strong (hardware-level)&lt;/td&gt;
      &lt;td&gt;Process-level&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Overhead&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Higher&lt;/td&gt;
      &lt;td&gt;Lower&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Density&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;10s per host&lt;/td&gt;
      &lt;td&gt;100s per host&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Portability&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Less portable&lt;/td&gt;
      &lt;td&gt;Highly portable&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;what-containers-share&quot;&gt;What Containers Share&lt;/h3&gt;

&lt;p&gt;Containers have a &lt;strong&gt;separate view&lt;/strong&gt; of:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Root filesystem&lt;/li&gt;
  &lt;li&gt;Libraries and utilities&lt;/li&gt;
  &lt;li&gt;Process tree&lt;/li&gt;
  &lt;li&gt;Users and permissions&lt;/li&gt;
  &lt;li&gt;Networking stack&lt;/li&gt;
  &lt;li&gt;IPC endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they &lt;strong&gt;share&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The same OS kernel&lt;/li&gt;
  &lt;li&gt;System calls interface&lt;/li&gt;
  &lt;li&gt;Hardware resources (managed by kernel)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;big-idea-less-os-overhead&quot;&gt;Big Idea: Less OS Overhead&lt;/h2&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────────┐
│  Traditional VMs: 3 VMs on one host         │
│  - 3 complete OS copies                     │
│  - High memory usage                        │
│  - Slower startup                           │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│  Containers: 10+ containers on same host    │
│  - 1 shared kernel                          │
│  - Lower memory usage                       │
│  - Instant startup                          │
└─────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;important-constraint-kernel-compatibility&quot;&gt;Important Constraint: Kernel Compatibility&lt;/h2&gt;

&lt;blockquote&gt;
  &lt;p&gt;[!IMPORTANT]
&lt;strong&gt;Container-Kernel Dependency&lt;/strong&gt;&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;A &lt;strong&gt;Linux container&lt;/strong&gt; needs a &lt;strong&gt;Linux kernel&lt;/strong&gt;&lt;/li&gt;
    &lt;li&gt;A &lt;strong&gt;Windows container&lt;/strong&gt; needs a &lt;strong&gt;Windows kernel&lt;/strong&gt;&lt;/li&gt;
    &lt;li&gt;You &lt;strong&gt;cannot&lt;/strong&gt; run a Windows container on a Linux host natively (or vice versa)&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;solutions&quot;&gt;Solutions&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;On Windows:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Use &lt;strong&gt;Windows Subsystem for Linux (WSL 2)&lt;/strong&gt; to run Linux containers&lt;/li&gt;
  &lt;li&gt;Use &lt;strong&gt;Hyper-V&lt;/strong&gt; to run a Linux VM that hosts containers&lt;/li&gt;
  &lt;li&gt;Use &lt;strong&gt;Windows Server&lt;/strong&gt; to run native Windows containers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On macOS:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Docker Desktop uses a lightweight Linux VM&lt;/li&gt;
  &lt;li&gt;Containers run inside this VM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On Linux:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Native container support&lt;/li&gt;
  &lt;li&gt;Best performance and compatibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;linux-primitives-for-containers&quot;&gt;Linux Primitives for Containers&lt;/h2&gt;

&lt;p&gt;Containers are built on two fundamental Linux kernel mechanisms:&lt;/p&gt;

&lt;h3 id=&quot;1-namespaces&quot;&gt;1. Namespaces&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Provide isolated view of global resources&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Group of processes see only their “slice” of a resource&lt;/li&gt;
  &lt;li&gt;Other processes cannot see or interfere with this slice&lt;/li&gt;
  &lt;li&gt;Creates the &lt;strong&gt;isolation&lt;/strong&gt; aspect of containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-control-groups-cgroups&quot;&gt;2. Control Groups (Cgroups)&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Control and limit resource usage&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Set limits on CPU, memory, I/O for process groups&lt;/li&gt;
  &lt;li&gt;Prevent resource exhaustion&lt;/li&gt;
  &lt;li&gt;Enables &lt;strong&gt;resource management&lt;/strong&gt; for containers&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│  Namespaces + Cgroups = Container       │
│                                         │
│  Namespaces → Isolation                 │
│  Cgroups    → Resource Limits           │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;namespaces-in-detail&quot;&gt;Namespaces in Detail&lt;/h2&gt;

&lt;p&gt;Linux provides several types of namespaces to isolate different resources:&lt;/p&gt;

&lt;h3 id=&quot;types-of-namespaces&quot;&gt;Types of Namespaces&lt;/h3&gt;

&lt;h4 id=&quot;1-mount-namespace&quot;&gt;1. Mount Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: Filesystem mount points&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Each namespace has its own view of the filesystem hierarchy&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Containers have their own root filesystem&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Container sees:&lt;/span&gt;
/
├── bin/
├── lib/
├── app/
└── ...

&lt;span class=&quot;c&quot;&gt;# Host sees different filesystem&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;2-pid-namespace&quot;&gt;2. PID Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: Process ID number space&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: First process in namespace gets PID 1&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Containers have their own process tree&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Host PID Namespace:
  PID 1234 → Container init process
  
Container PID Namespace:
  PID 1 → Same process (appears as init)
  PID 2 → First child process
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;3-network-namespace&quot;&gt;3. Network Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: Network resources (IP addresses, routing tables, ports)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Each namespace has its own network stack&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Containers can have their own IP addresses&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Container can bind to port 80&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Host can also bind to port 80&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# No conflict because different network namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;4-uts-namespace&quot;&gt;4. UTS Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: Hostname and domain name&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Each namespace can have different hostname&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Containers have unique hostnames&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;5-user-namespace&quot;&gt;5. User Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: User and group ID number space&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Process can be root in container but unprivileged on host&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Enhanced security (rootless containers)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Container: UID 0 (root)
    ↓ mapped to
Host: UID 1000 (regular user)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;6-ipc-namespace&quot;&gt;6. IPC Namespace&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolates&lt;/strong&gt;: Inter-Process Communication endpoints&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Effect&lt;/strong&gt;: Separate message queues, semaphores, shared memory&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Use&lt;/strong&gt;: Prevent IPC interference between containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;namespace-api&quot;&gt;Namespace API&lt;/h3&gt;

&lt;p&gt;Three key system calls:&lt;/p&gt;

&lt;h4 id=&quot;1-clone&quot;&gt;1. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clone()&lt;/code&gt;&lt;/h4&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Create new process in new namespace&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;clone&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;child_func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CLONE_NEWPID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CLONE_NEWNET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;More general version of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork()&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Flags specify what to share vs create new&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-setns&quot;&gt;2. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;setns()&lt;/code&gt;&lt;/h4&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Join an existing namespace&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;setns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;namespace_fd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CLONE_NEWNET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;Allows process to enter existing namespace&lt;/li&gt;
  &lt;li&gt;Useful for debugging containers&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;3-unshare&quot;&gt;3. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;unshare()&lt;/code&gt;&lt;/h4&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Create new namespace for calling process&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;unshare&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CLONE_NEWPID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CLONE_NEWNET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;ul&gt;
  &lt;li&gt;Calling process moves to new namespace&lt;/li&gt;
  &lt;li&gt;Equivalent to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fork()&lt;/code&gt; + &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clone()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;viewing-namespaces&quot;&gt;Viewing Namespaces&lt;/h3&gt;

&lt;p&gt;Namespaces are represented as files in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/proc&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# View namespaces for your current shell&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt; /proc/&lt;span class=&quot;nv&quot;&gt;$$&lt;/span&gt;/ns

&lt;span class=&quot;c&quot;&gt;# Output:&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 ipc -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;ipc:[4026531839]&apos;&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 mnt -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;mnt:[4026531840]&apos;&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 net -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;net:[4026531992]&apos;&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 pid -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;pid:[4026531836]&apos;&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 user -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;user:[4026531837]&apos;&lt;/span&gt;
lrwxrwxrwx 1 user user 0 Jan  5 10:00 uts -&amp;gt; &lt;span class=&quot;s1&quot;&gt;&apos;uts:[4026531838]&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The ID (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[4026531839]&lt;/code&gt;) uniquely identifies the namespace. Processes with the same ID share that namespace.&lt;/p&gt;

&lt;h2 id=&quot;control-groups-cgroups&quot;&gt;Control Groups (Cgroups)&lt;/h2&gt;

&lt;h3 id=&quot;purpose&quot;&gt;Purpose&lt;/h3&gt;

&lt;p&gt;Cgroups allow you to:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Limit&lt;/strong&gt; resources (CPU, memory, I/O)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Prioritize&lt;/strong&gt; resource allocation&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Account&lt;/strong&gt; for resource usage&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Control&lt;/strong&gt; which CPUs/memory nodes processes can use&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;cgroup-hierarchies&quot;&gt;Cgroup Hierarchies&lt;/h3&gt;

&lt;p&gt;Resources can be organized hierarchically:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;All CPU Resources
├── CPU-Faculty (40%)
│   ├── Fac-Web (50%)
│   └── Fac-Non-Web (50%)
└── CPU-Students (60%)
    ├── Student-Web (50%)
    └── Student-Non-Web (50%)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;creating-cgroups&quot;&gt;Creating Cgroups&lt;/h3&gt;

&lt;p&gt;Managed via filesystem (no new system calls):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Cgroup filesystem mounted at&lt;/span&gt;
/sys/fs/cgroup/

&lt;span class=&quot;c&quot;&gt;# Create a cgroup for limiting memory&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; /sys/fs/cgroup/memory/mycontainer

&lt;span class=&quot;c&quot;&gt;# Set memory limit to 512MB&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;echo &lt;/span&gt;536870912 &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes

&lt;span class=&quot;c&quot;&gt;# Add process to cgroup&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$PID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/memory/mycontainer/tasks
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;resource-types&quot;&gt;Resource Types&lt;/h3&gt;

&lt;p&gt;Cgroups can control:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: CPU time, CPU shares, CPU quotas&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: Memory limits, swap limits&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Block I/O&lt;/strong&gt;: I/O bandwidth limits&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Network&lt;/strong&gt;: Network priority (with tc)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Devices&lt;/strong&gt;: Device access control&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CPUsets&lt;/strong&gt;: Which CPUs/memory nodes to use&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;how-to-create-a-container&quot;&gt;How to Create a Container&lt;/h2&gt;

&lt;p&gt;Putting it all together:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;1. Create namespaces for isolation
   ├── PID namespace (process isolation)
   ├── Mount namespace (filesystem isolation)
   ├── Network namespace (network isolation)
   └── User namespace (security)

2. Create and configure cgroups for resource limits
   ├── CPU limits
   ├── Memory limits
   └── I/O limits

3. Create root filesystem
   ├── Base OS files (minimal)
   ├── Application binaries
   ├── Required libraries
   └── Configuration files

4. Enter namespaces, mount rootfs, register in cgroups

5. Execute application or shell

→ Your application is now running in a &quot;container&quot;!
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;container-frameworks&quot;&gt;Container Frameworks&lt;/h2&gt;

&lt;h3 id=&quot;why-use-frameworks&quot;&gt;Why Use Frameworks?&lt;/h3&gt;

&lt;p&gt;Creating containers manually is complex. Frameworks automate:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Namespace and cgroup configuration&lt;/li&gt;
  &lt;li&gt;Filesystem management&lt;/li&gt;
  &lt;li&gt;Network setup&lt;/li&gt;
  &lt;li&gt;Image distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;lxc-linux-containers&quot;&gt;LXC (Linux Containers)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;General-purpose&lt;/strong&gt; container framework&lt;/li&gt;
  &lt;li&gt;Provides standard OS shell interface&lt;/li&gt;
  &lt;li&gt;Acts like a lightweight VM&lt;/li&gt;
  &lt;li&gt;Uses namespaces and cgroups under the hood&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;docker&quot;&gt;Docker&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Application-focused&lt;/strong&gt; containers&lt;/li&gt;
  &lt;li&gt;Optimized to run a single application&lt;/li&gt;
  &lt;li&gt;Easy packaging and distribution&lt;/li&gt;
  &lt;li&gt;Dockerfile for reproducible builds&lt;/li&gt;
  &lt;li&gt;Docker Hub for sharing images&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;comparison&quot;&gt;Comparison&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Feature&lt;/th&gt;
      &lt;th&gt;LXC&lt;/th&gt;
      &lt;th&gt;Docker&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;System containers&lt;/td&gt;
      &lt;td&gt;Application containers&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Interface&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Full OS environment&lt;/td&gt;
      &lt;td&gt;Single application&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;VM replacement&lt;/td&gt;
      &lt;td&gt;App deployment&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Ecosystem&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Smaller&lt;/td&gt;
      &lt;td&gt;Large (Docker Hub)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;what-containers-can-do&quot;&gt;What Containers CAN Do&lt;/h2&gt;

&lt;p&gt;✓ &lt;strong&gt;Run different Linux distributions&lt;/strong&gt; on the same host&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Ubuntu container on Red Hat host&lt;/li&gt;
  &lt;li&gt;Alpine container on Ubuntu host&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Run applications with different dependencies&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Python 3.9 in one container&lt;/li&gt;
  &lt;li&gt;Python 3.11 in another&lt;/li&gt;
  &lt;li&gt;Even if host has no Python installed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Use the host’s hardware and system calls&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Access network interfaces&lt;/li&gt;
  &lt;li&gt;Use GPUs (with proper drivers)&lt;/li&gt;
  &lt;li&gt;Access storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✓ &lt;strong&gt;Provide isolation and security&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Process isolation&lt;/li&gt;
  &lt;li&gt;Filesystem isolation&lt;/li&gt;
  &lt;li&gt;Network isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Containers provide &lt;strong&gt;lightweight isolation&lt;/strong&gt; with lower overhead than VMs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Share the same kernel&lt;/strong&gt; but have different root filesystems&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Built on Linux primitives&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Namespaces for isolation&lt;/li&gt;
      &lt;li&gt;Cgroups for resource limits&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Frameworks like Docker and LXC&lt;/strong&gt; provide user-friendly interfaces&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enable modern architectures&lt;/strong&gt;: microservices, serverless, cloud-native&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next lecture, we’ll explore &lt;strong&gt;Docker&lt;/strong&gt; in depth, including how to create, manage, and deploy containerized applications.&lt;/p&gt;

&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://lwn.net/Articles/531114/&quot;&gt;Namespaces in operation (lwn.net)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://lwn.net/Articles/524935/&quot;&gt;Documentation/cgroups/cgroups.txt&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://linuxcontainers.org/&quot;&gt;Linux Containers (LXC)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>08-01 Virtualization Fundamentals</title>
   <link href="https://nglelinh.github.io/contents/en/chapter08/08_01_Virtualization_Fundamentals/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter08/08_01_Virtualization_Fundamentals</id>
   <content type="html">&lt;p&gt;Virtualization is a foundational technology in modern computing that enables multiple operating systems and applications to run on a single physical machine. This lecture explores the core concepts, history, and types of virtualization.&lt;/p&gt;

&lt;h2 id=&quot;what-is-virtualization&quot;&gt;What is Virtualization?&lt;/h2&gt;

&lt;h3 id=&quot;general-definition&quot;&gt;General Definition&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Virtualization&lt;/strong&gt; refers to creating a virtual (rather than physical) version of computing resources. In the context of this course:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;A machine implemented in software&lt;/strong&gt;, rather than hardware&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;A self-contained environment&lt;/strong&gt; that acts like a computer&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;An abstract specification&lt;/strong&gt; for a computing device (instruction set, memory, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;common-distinction&quot;&gt;Common Distinction&lt;/h3&gt;

&lt;p&gt;There are two main categories of virtual machines:&lt;/p&gt;

&lt;h4 id=&quot;1-language-based-virtual-machines&quot;&gt;1. Language-Based Virtual Machines&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Instruction set usually does not resemble any existing architecture&lt;/li&gt;
  &lt;li&gt;Examples: &lt;strong&gt;Java VM&lt;/strong&gt;, &lt;strong&gt;.NET CLR&lt;/strong&gt;, Python VM&lt;/li&gt;
  &lt;li&gt;Designed for platform independence and security&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-virtual-machine-monitors-vmm-or-hypervisors&quot;&gt;2. Virtual Machine Monitors (VMM) or Hypervisors&lt;/h4&gt;
&lt;ul&gt;
  &lt;li&gt;Instruction set fully or partially taken from a real architecture&lt;/li&gt;
  &lt;li&gt;Virtualizes complete hardware systems&lt;/li&gt;
  &lt;li&gt;Examples: &lt;strong&gt;VMware&lt;/strong&gt;, &lt;strong&gt;Xen&lt;/strong&gt;, &lt;strong&gt;KVM&lt;/strong&gt;, &lt;strong&gt;Hyper-V&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;This course focuses primarily on &lt;strong&gt;hypervisor-based virtualization&lt;/strong&gt; used in cloud computing.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;history-of-virtualization&quot;&gt;History of Virtualization&lt;/h2&gt;

&lt;p&gt;The evolution of virtualization technology:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;1972 → IBM VM/370
       First VM architecture for mainframe machines
       
1997 → Virtual PC for Mac
       Connectix brings virtualization to personal computers
       
1999 → VMware Virtual Platform
       Commercial virtualization for x86 architecture
       
2003 → Xen Hypervisor
       Open-source hypervisor project launched
       
2005 → VMware Player
       Free VM player for end users
       
2007 → VirtualBox
       Cross-platform virtualization software
       
2010s → Cloud Era
       AWS, Azure, GCP built on virtualization
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;virtualization-terminology&quot;&gt;Virtualization Terminology&lt;/h2&gt;

&lt;p&gt;Understanding the key terms:&lt;/p&gt;

&lt;h3 id=&quot;guest-os-vs-host-os&quot;&gt;Guest OS vs Host OS&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Guest OS&lt;/strong&gt;: The operating system running inside the virtual machine&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Host OS&lt;/strong&gt;: The operating system running on the physical machine (for Type 2 hypervisors)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Physical Machine (PM)&lt;/strong&gt;: The actual hardware&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Virtual Machine (VM)&lt;/strong&gt;: The virtualized environment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;type-1-hypervisor-bare-metal&quot;&gt;Type 1 Hypervisor (Bare Metal)&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│     Virtual Machine 1  │  VM 2      │
│  ┌──────────────┐  ┌──────────────┐ │
│  │  Guest OS 1  │  │  Guest OS 2  │ │
│  └──────────────┘  └──────────────┘ │
├─────────────────────────────────────┤
│      Type 1 Hypervisor              │
├─────────────────────────────────────┤
│      Hardware (CPU/RAM/Disk)        │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Runs directly on hardware&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;No need for a host operating system&lt;/li&gt;
  &lt;li&gt;Better performance and efficiency&lt;/li&gt;
  &lt;li&gt;Examples: &lt;strong&gt;VMware ESXi&lt;/strong&gt;, &lt;strong&gt;Xen&lt;/strong&gt;, &lt;strong&gt;Hyper-V&lt;/strong&gt;, &lt;strong&gt;KVM&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Used in: Data centers, cloud providers&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;type-2-hypervisor-hosted&quot;&gt;Type 2 Hypervisor (Hosted)&lt;/h3&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│     Virtual Machine 1  │  VM 2      │
│  ┌──────────────┐  ┌──────────────┐ │
│  │  Guest OS 1  │  │  Guest OS 2  │ │
│  └──────────────┘  └──────────────┘ │
├─────────────────────────────────────┤
│      Type 2 Hypervisor              │
├─────────────────────────────────────┤
│         Host OS                     │
├─────────────────────────────────────┤
│      Hardware (CPU/RAM/Disk)        │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Runs as an application&lt;/strong&gt; on top of a host OS&lt;/li&gt;
  &lt;li&gt;Easier to set up and use&lt;/li&gt;
  &lt;li&gt;Slightly lower performance due to host OS overhead&lt;/li&gt;
  &lt;li&gt;Examples: &lt;strong&gt;VMware Workstation&lt;/strong&gt;, &lt;strong&gt;VirtualBox&lt;/strong&gt;, &lt;strong&gt;Parallels Desktop&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Used in: Development, testing, desktop virtualization&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;properties-of-virtual-machines&quot;&gt;Properties of Virtual Machines&lt;/h2&gt;

&lt;p&gt;Virtual machines provide four key properties that make them valuable:&lt;/p&gt;

&lt;h3 id=&quot;1-partitioning&quot;&gt;1. Partitioning&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Run multiple operating systems&lt;/strong&gt; on one physical machine&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Divide system resources&lt;/strong&gt; between virtual machines&lt;/li&gt;
  &lt;li&gt;Each VM gets allocated CPU, memory, disk, and network resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Physical Server: 64 GB RAM, 16 CPU cores
├── VM1 (Web Server):    16 GB RAM, 4 cores
├── VM2 (Database):      32 GB RAM, 8 cores
└── VM3 (App Server):    16 GB RAM, 4 cores
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-isolation&quot;&gt;2. Isolation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Fault isolation&lt;/strong&gt;: If one VM crashes, others continue running&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security isolation&lt;/strong&gt;: VMs are separated at the hardware level&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Performance isolation&lt;/strong&gt;: Resource controls prevent “noisy neighbor” problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Secure multi-tenancy in cloud environments&lt;/li&gt;
  &lt;li&gt;Contain security breaches&lt;/li&gt;
  &lt;li&gt;Predictable performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-encapsulation&quot;&gt;3. Encapsulation&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Save entire VM state to files&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Move and copy VMs&lt;/strong&gt; as easily as files&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Snapshot and restore&lt;/strong&gt; VM states&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# VM files typically include:&lt;/span&gt;
- .vmdk / .vdi  → Virtual disk files
- .vmx / .vbox  → VM configuration
- .nvram        → BIOS settings
- .vmem         → Memory snapshot
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This enables:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Easy backup and disaster recovery&lt;/li&gt;
  &lt;li&gt;VM migration between hosts&lt;/li&gt;
  &lt;li&gt;Template-based deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-hardware-independence&quot;&gt;4. Hardware Independence&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Provision or migrate&lt;/strong&gt; any VM to any physical server&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Abstract hardware&lt;/strong&gt; from the operating system&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Standardized virtual hardware&lt;/strong&gt; regardless of physical hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Workload mobility across different hardware&lt;/li&gt;
  &lt;li&gt;Simplified hardware upgrades&lt;/li&gt;
  &lt;li&gt;Cloud provider flexibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;how-virtualization-works&quot;&gt;How Virtualization Works&lt;/h2&gt;

&lt;h3 id=&quot;the-illusion&quot;&gt;The Illusion&lt;/h3&gt;

&lt;p&gt;The VM gives users an &lt;strong&gt;illusion of running on a physical machine&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Applications run normally without modification&lt;/li&gt;
  &lt;li&gt;OS believes it has direct hardware access&lt;/li&gt;
  &lt;li&gt;Users interact with VM like a real computer&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-reality&quot;&gt;The Reality&lt;/h3&gt;

&lt;p&gt;Behind the scenes:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Operating systems normally run in privileged mode&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Direct access to hardware&lt;/li&gt;
      &lt;li&gt;Can execute privileged instructions&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;VM OSs run in user mode&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;No direct hardware access&lt;/li&gt;
      &lt;li&gt;Privileged instructions are trapped&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Most instructions execute directly&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Hardware executes them without hypervisor intervention&lt;/li&gt;
      &lt;li&gt;Provides near-native performance&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource management handled by hypervisor&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Memory allocation&lt;/li&gt;
      &lt;li&gt;Peripheral access&lt;/li&gt;
      &lt;li&gt;CPU scheduling&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Privileged instructions are “trapped”&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Hypervisor intercepts them&lt;/li&gt;
      &lt;li&gt;Emulates the instruction&lt;/li&gt;
      &lt;li&gt;Returns control to VM&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;hardware-assistance&quot;&gt;Hardware Assistance&lt;/h3&gt;

&lt;p&gt;Modern CPUs include virtualization support:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Intel VT-x&lt;/strong&gt; (Intel Virtualization Technology)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;AMD-V&lt;/strong&gt; (AMD Virtualization)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These provide:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Hardware-assisted virtualization&lt;/li&gt;
  &lt;li&gt;Improved performance&lt;/li&gt;
  &lt;li&gt;Better security isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;vm-components&quot;&gt;VM Components&lt;/h2&gt;

&lt;p&gt;A virtual machine virtualizes several key components:&lt;/p&gt;

&lt;h3 id=&quot;1-cpu-virtualization&quot;&gt;1. CPU Virtualization&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Virtual CPUs (vCPUs) mapped to physical CPUs&lt;/li&gt;
  &lt;li&gt;CPU scheduling by hypervisor&lt;/li&gt;
  &lt;li&gt;Support for multiple CPU architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-memory-virtualization&quot;&gt;2. Memory Virtualization&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Virtual memory presented to guest OS&lt;/li&gt;
  &lt;li&gt;Memory management by hypervisor&lt;/li&gt;
  &lt;li&gt;Techniques like memory ballooning and page sharing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-network-virtualization&quot;&gt;3. Network Virtualization&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Virtual network interfaces&lt;/li&gt;
  &lt;li&gt;Virtual switches and routers&lt;/li&gt;
  &lt;li&gt;Network isolation and bridging&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-disk-virtualization&quot;&gt;4. Disk Virtualization&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Virtual disks stored as files&lt;/li&gt;
  &lt;li&gt;Thin provisioning and snapshots&lt;/li&gt;
  &lt;li&gt;Multiple disk formats (VMDK, VHD, QCOW2)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;storage-and-migration&quot;&gt;Storage and Migration&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;VM is stored as a file&lt;/strong&gt; → Easy to save and migrate&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Can be moved&lt;/strong&gt; along with apps and configuration&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enables&lt;/strong&gt; cloud elasticity and disaster recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;paravirtualization&quot;&gt;Paravirtualization&lt;/h2&gt;

&lt;p&gt;An alternative approach to full virtualization:&lt;/p&gt;

&lt;h3 id=&quot;concept&quot;&gt;Concept&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;VM OS knows about virtualization&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;Makes specific hypercalls instead of privileged instructions&lt;/li&gt;
  &lt;li&gt;Requires modified guest OS&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;advantages&quot;&gt;Advantages&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Better performance than full virtualization&lt;/li&gt;
  &lt;li&gt;More efficient resource usage&lt;/li&gt;
  &lt;li&gt;Lower overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;example-xen&quot;&gt;Example: Xen&lt;/h3&gt;

&lt;p&gt;Xen hypervisor uses paravirtualization:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Guest OS modified to make hypercalls&lt;/li&gt;
  &lt;li&gt;Direct communication with hypervisor&lt;/li&gt;
  &lt;li&gt;Used by early AWS EC2 instances&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;modern-trend&quot;&gt;Modern Trend&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Hardware-assisted virtualization has reduced the need for paravirtualization&lt;/li&gt;
  &lt;li&gt;Most modern systems use full virtualization with hardware support&lt;/li&gt;
  &lt;li&gt;Paravirtualization still used for specific optimizations (e.g., virtio drivers)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;comparison-type-1-vs-type-2&quot;&gt;Comparison: Type 1 vs Type 2&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Aspect&lt;/th&gt;
      &lt;th&gt;Type 1 (Bare Metal)&lt;/th&gt;
      &lt;th&gt;Type 2 (Hosted)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Installation&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Directly on hardware&lt;/td&gt;
      &lt;td&gt;On top of host OS&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Higher (direct hardware access)&lt;/td&gt;
      &lt;td&gt;Lower (host OS overhead)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Use Case&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Production servers, cloud&lt;/td&gt;
      &lt;td&gt;Development, testing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Examples&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;ESXi, Xen, Hyper-V&lt;/td&gt;
      &lt;td&gt;VirtualBox, VMware Workstation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Management&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;More complex&lt;/td&gt;
      &lt;td&gt;Easier to use&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Often enterprise licensing&lt;/td&gt;
      &lt;td&gt;Often free or low-cost&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;real-world-applications&quot;&gt;Real-World Applications&lt;/h2&gt;

&lt;h3 id=&quot;cloud-computing&quot;&gt;Cloud Computing&lt;/h3&gt;

&lt;p&gt;All major cloud providers use virtualization:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;AWS&lt;/strong&gt;: Xen, KVM (Nitro), bare metal options&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Azure&lt;/strong&gt;: Hyper-V&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Google Cloud&lt;/strong&gt;: KVM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;enterprise-data-centers&quot;&gt;Enterprise Data Centers&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Server consolidation (reduce physical servers)&lt;/li&gt;
  &lt;li&gt;Disaster recovery and business continuity&lt;/li&gt;
  &lt;li&gt;Test and development environments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;desktop-virtualization-vdi&quot;&gt;Desktop Virtualization (VDI)&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Virtual desktops for remote workers&lt;/li&gt;
  &lt;li&gt;Centralized management&lt;/li&gt;
  &lt;li&gt;Enhanced security&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Virtualization is a cornerstone technology that:&lt;/p&gt;

&lt;p&gt;✓ &lt;strong&gt;Enables multiple VMs&lt;/strong&gt; on a single physical machine&lt;br /&gt;
✓ &lt;strong&gt;Provides isolation&lt;/strong&gt; for security and fault tolerance&lt;br /&gt;
✓ &lt;strong&gt;Allows easy migration&lt;/strong&gt; and backup through encapsulation&lt;br /&gt;
✓ &lt;strong&gt;Abstracts hardware&lt;/strong&gt; for flexibility and portability&lt;br /&gt;
✓ &lt;strong&gt;Powers modern cloud computing&lt;/strong&gt; platforms&lt;/p&gt;

&lt;p&gt;In the next lecture, we’ll dive deeper into &lt;strong&gt;how virtualization works internally&lt;/strong&gt;, including binary translation and dynamic translation techniques.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>07 Introduction to NoSQL and Distributed Databases</title>
   <link href="https://nglelinh.github.io/contents/en/chapter07/07_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter07/07_Introduction</id>
   <content type="html">&lt;p&gt;This chapter moves beyond traditional Relational Database Management Systems (RDBMS) to explore the world of NoSQL and Distributed Databases, designed to handle the scale, velocity, and variety of modern big data.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Explain the limitations of monolithic RDBMS in distributed environments&lt;/li&gt;
  &lt;li&gt;Understand the CAP Theorem and consistency trade-offs (BASE vs ACID)&lt;/li&gt;
  &lt;li&gt;Explore the four main types of NoSQL databases: Key-Value, Document, Column-family, and Graph&lt;/li&gt;
  &lt;li&gt;Analyze distributed database concepts: Sharding, Replication, and Consistent Hashing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-data-evolution&quot;&gt;The Data Evolution&lt;/h2&gt;

&lt;p&gt;As applications scaled to millions of users, the vertical scaling limits of SQL databases became a bottleneck. This led to the emergence of distributed databases that sacrifice some consistency guarantees for partition tolerance and high availability.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>07-01 NoSQL and Distributed Databases</title>
   <link href="https://nglelinh.github.io/contents/en/chapter07/07_01_NoSQL_and_Distributed_Databases/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter07/07_01_NoSQL_and_Distributed_Databases</id>
   <content type="html">&lt;p&gt;As applications scale, traditional relational databases (RDBMS) often become bottlenecks. This lecture explores Distributed Databases and NoSQL systems designed for scale and flexibility.&lt;/p&gt;

&lt;h2 id=&quot;distributed-databases-ddb&quot;&gt;Distributed Databases (DDB)&lt;/h2&gt;

&lt;p&gt;A collection of multiple, logically interconnected databases that are physically distributed over a computer network.&lt;/p&gt;

&lt;h3 id=&quot;architectures&quot;&gt;Architectures&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Homogeneous&lt;/strong&gt;: Same DBMS (e.g., all Oracle) and schema at all sites. Easier to manage.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Heterogeneous&lt;/strong&gt;: Different DBMSs or schemas. Requires middleware/integration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;distribution-strategies&quot;&gt;Distribution Strategies&lt;/h3&gt;

&lt;h4 id=&quot;1-replication&quot;&gt;1. Replication&lt;/h4&gt;
&lt;p&gt;Stroing multiple copies of data (instances) at different sites.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: High availability (fault tolerance), fast local reads.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Write performance suffers (must update all copies), consistency challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-fragmentation-partitioning&quot;&gt;2. Fragmentation (Partitioning)&lt;/h4&gt;
&lt;p&gt;Splitting the database into fragments.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Horizontal Fragmentation (Sharding)&lt;/strong&gt;: Splitting a table by rows. (e.g., Customers ID 1-1000 in NY, 1001-2000 in London).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Vertical Fragmentation&lt;/strong&gt;: Splitting a table by columns. (e.g., Employee Name/Address in table A, Salary/Benefits in table B).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;consistent-hashing&quot;&gt;Consistent Hashing&lt;/h3&gt;
&lt;p&gt;A technique to distribute data across nodes in a way that minimizes reorganization when nodes are added/removed. Used in DynamoDB, Cassandra, etc.&lt;/p&gt;

&lt;h2 id=&quot;nosql-databases&quot;&gt;NoSQL Databases&lt;/h2&gt;

&lt;p&gt;“NoSQL” stands for “Not Only SQL”. These systems emerged to handle:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Big Data&lt;/strong&gt;: Petabytes of data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Velocity&lt;/strong&gt;: High read/write throughput.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Variety&lt;/strong&gt;: Unstructured or semi-structured data (JSON, XML).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;cap-theorem&quot;&gt;CAP Theorem&lt;/h3&gt;
&lt;p&gt;In a distributed computer system, you can only provide two of the following three guarantees:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Every read receives the most recent write or an error.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Availability&lt;/strong&gt;: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Partition Tolerance&lt;/strong&gt;: The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Trade-offs:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;CA (RDBMS)&lt;/strong&gt;: Consistent and Available, but can’t handle network partitions (distributed scale).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;CP (MongoDB, HBase)&lt;/strong&gt;: Consistent and Partition Tolerant. If a partition occurs, some nodes may reject writes to ensure consistency.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;AP (Cassandra, DynamoDB)&lt;/strong&gt;: Available and Partition Tolerant. Reads always succeed but might return stale data (Eventual Consistency).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;base-model&quot;&gt;BASE Model&lt;/h3&gt;
&lt;p&gt;A softer consistency model for NoSQL systems:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Basically Available&lt;/strong&gt;: System guarantees availability.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Soft state&lt;/strong&gt;: State may change over time, even without input.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Eventually consistent&lt;/strong&gt;: System will become consistent over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;types-of-nosql-databases&quot;&gt;Types of NoSQL Databases&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Key-Value Stores&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Model&lt;/strong&gt;: Simple Map (Key -&amp;gt; Value).&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Use cases&lt;/strong&gt;: Caching, Session storage.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: Redis, Amazon DynamoDB, Riak.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Document Stores&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Model&lt;/strong&gt;: Store data as documents (JSON/BSON).&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Use cases&lt;/strong&gt;: Content management, catalogs.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: MongoDB, CouchDB.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Column-Family Stores&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Model&lt;/strong&gt;: Store data by columns rather than rows. Optimized for writes and analytics.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Use cases&lt;/strong&gt;: Time-series data, Big Data analytics.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: Apache Cassandra, HBase.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Graph Databases&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;&lt;strong&gt;Model&lt;/strong&gt;: Nodes and Edges.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Use cases&lt;/strong&gt;: Social networks, Recommendation engines.&lt;/li&gt;
      &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: Neo4j.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Distributed databases and NoSQL systems sacrifice strict ACID properties (specifically consistency) to achieve high availability and partition tolerance (scalability). Understanding the CAP theorem and the specific data model (Key-Value, Document, etc.) is crucial for choosing the right tool for the job.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>06 Introduction to Spark Ecosystem</title>
   <link href="https://nglelinh.github.io/contents/en/chapter06/06_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter06/06_Introduction</id>
   <content type="html">&lt;p&gt;This chapter moves beyond the core Spark engine to explore the powerful libraries built on top of it: Spark SQL for structured data, Spark Streaming for real-time processing, and MLlib for machine learning.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Spark SQL&lt;/strong&gt;: Query structured data using SQL and DataFrames&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Spark Streaming&lt;/strong&gt;: Process real-time data streams with fault tolerance&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;MLlib&lt;/strong&gt;: Build and deploy scalable machine learning models&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;GraphX&lt;/strong&gt;: (Overview) Analyze graph-structured data&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-unified-engine&quot;&gt;The Unified Engine&lt;/h2&gt;

&lt;p&gt;The true power of Spark lies in its unified stack. You can load data using Spark SQL, train a model using MLlib, and apply that model to a real-time stream using Spark Streaming—all within the same application.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>06-04 Scalable Machine Learning with MLlib</title>
   <link href="https://nglelinh.github.io/contents/en/chapter06/06_04_Spark_MLlib/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter06/06_04_Spark_MLlib</id>
   <content type="html">&lt;p&gt;Apache Spark MLlib is a scalable machine learning library that brings high-performance ML algorithms to distributed computing.&lt;/p&gt;

&lt;h2 id=&quot;why-distributed-ml&quot;&gt;Why Distributed ML?&lt;/h2&gt;

&lt;p&gt;Traditional ML libraries (like scikit-learn) run on a single machine. When your dataset becomes too large to fit in memory (TB or PB scale), you need a distributed solution. MLlib is designed to scale horizontally.&lt;/p&gt;

&lt;h2 id=&quot;key-capabilities&quot;&gt;Key Capabilities&lt;/h2&gt;

&lt;p&gt;MLlib covers the standard ML workflow primitives:&lt;/p&gt;

&lt;h3 id=&quot;1-classification--regression&quot;&gt;1. Classification &amp;amp; Regression&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Classification&lt;/strong&gt;: Logistic Regression, Naive Bayes, Decision Trees, Random Forests.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Regression&lt;/strong&gt;: Linear Regression, Generalized Linear Regression (GLM).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-clustering&quot;&gt;2. Clustering&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;K-Means&lt;/strong&gt;: Partitioning data into K distinct clusters.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;LDA (Latent Dirichlet Allocation)&lt;/strong&gt;: Topic modeling.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Gaussian Mixture Models&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-collaborative-filtering&quot;&gt;3. Collaborative Filtering&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;ALS (Alternating Least Squares)&lt;/strong&gt;: Used for recommendation systems (e.g., “Users who bought X also bought Y”).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-dimensionality-reduction&quot;&gt;4. Dimensionality Reduction&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;PCA (Principal Component Analysis)&lt;/strong&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;SVD (Singular Value Decomposition)&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;ml-pipelines&quot;&gt;ML Pipelines&lt;/h2&gt;

&lt;p&gt;Inspired by scikit-learn, MLlib provides a &lt;strong&gt;Pipeline API&lt;/strong&gt; to create consistent workflows. A pipeline chains together &lt;strong&gt;Transformers&lt;/strong&gt; and &lt;strong&gt;Estimators&lt;/strong&gt;.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.ml&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.ml.classification&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LogisticRegression&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.ml.feature&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HashingTF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tokenizer&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 1. Prepare data
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;createDataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;il&quot;&gt;0L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;a b c d e spark&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;il&quot;&gt;1L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;b d&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;il&quot;&gt;2L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark f g h&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;il&quot;&gt;3L&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;hadoop mapreduce&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;label&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 2. Configure an ML pipeline
# Tokenizer: Split text into words (Transformer)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputCol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outputCol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# HashingTF: Convert words to feature vectors (Transformer)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hashingTF&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;HashingTF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputCol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOutputCol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outputCol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;features&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# LogisticRegression: The learning algorithm (Estimator)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;LogisticRegression&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;maxIter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regParam&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Pipeline: Chain stages together
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stages&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokenizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hashingTF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 3. Train model
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 4. Make predictions
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;rdd-vs-dataframe-api&quot;&gt;RDD vs DataFrame API&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;RDD API (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyspark.mllib&lt;/code&gt;)&lt;/strong&gt;: The original API. Now in maintenance mode.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DataFrame API (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pyspark.ml&lt;/code&gt;)&lt;/strong&gt;: The modern, primary API. It provides a more uniform set of APIs and leverages Spark SQL optimizations. &lt;strong&gt;Use this one.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;best-practices&quot;&gt;Best Practices&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Data Quality&lt;/strong&gt;: “Garbage In, Garbage Out”. Use Spark SQL to clean your data before feeding it to MLlib.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Overfitting&lt;/strong&gt;: Be careful of models that learn the “noise” in your training data. Use cross-validation.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Feature Engineering&lt;/strong&gt;: Transforming raw data into meaningful features is often more important than the choice of algorithm.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;MLlib democratizes large-scale machine learning, allowing data engineers and data scientists to build complex models on massive datasets using familiar APIs.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>06-03 Spark SQL and DataFrames</title>
   <link href="https://nglelinh.github.io/contents/en/chapter06/06_03_Spark_SQL/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter06/06_03_Spark_SQL</id>
   <content type="html">&lt;p&gt;Spark SQL is one of the most widely used modules in Apache Spark, bridging the gap between relational data processing and functional programming. It allows you to run SQL queries on distributed data and provides the DataFrame API.&lt;/p&gt;

&lt;h2 id=&quot;traditional-rdbms-vs-spark-sql&quot;&gt;Traditional RDBMS vs. Spark SQL&lt;/h2&gt;

&lt;h3 id=&quot;the-limits-of-rdbms&quot;&gt;The Limits of RDBMS&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scaling&lt;/strong&gt;: Traditional RDBMS (PostgreSQL, MySQL, Oracle) are designed for vertical scaling. Distributing them is complex (sharding).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data Types&lt;/strong&gt;: Optimized for structured data (integers, strings). Struggle with semi-structured (JSON) or unstructured data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Processing&lt;/strong&gt;: “Under the hood” black-box optimization.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-spark-sql-approach&quot;&gt;The Spark SQL Approach&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scaling&lt;/strong&gt;: Horizontally scalable across thousands of nodes.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Handles structured (Schema) and semi-structured (JSON, Parquet) data natively.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Integration&lt;/strong&gt;: Mix SQL queries with complex code (Java/Scala/Python) in the same application.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;core-concepts&quot;&gt;Core Concepts&lt;/h2&gt;

&lt;h3 id=&quot;1-dataframes&quot;&gt;1. DataFrames&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;DataFrame&lt;/strong&gt; is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;SparkSQLExample&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Create DataFrame from JSON
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;people.json&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Show schema
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;printSchema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# root
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)
&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# DSL Operations
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;age&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;21&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-running-sql-queries&quot;&gt;2. Running SQL Queries&lt;/h3&gt;
&lt;p&gt;You can register a DataFrame as a temporary view and run standard SQL queries against it.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;people&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sqlDF&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;SELECT * FROM people WHERE age BETWEEN 13 AND 19&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sqlDF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;3-interoperability&quot;&gt;3. Interoperability&lt;/h3&gt;
&lt;p&gt;Spark SQL supports a wide variety of data sources:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Parquet/ORC&lt;/strong&gt;: Optimized columnar storage formats.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;JSON/CSV&lt;/strong&gt;: Common text formats.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hive&lt;/strong&gt;: Access existing Hive warehouses.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;JDBC/ODBC&lt;/strong&gt;: Connect to external databases (MySQL, PostgreSQL).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-catalyst-optimizer&quot;&gt;The Catalyst Optimizer&lt;/h2&gt;

&lt;p&gt;The secret sauce of Spark SQL is the &lt;strong&gt;Catalyst Optimizer&lt;/strong&gt;. It leverages advanced functional programming features (like pattern matching in Scala) to build an extensible query optimizer.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Analysis&lt;/strong&gt;: Resolves table and column names.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Logical Optimization&lt;/strong&gt;: Applies standard rules (predicate pushdown, constant folding, projection pruning).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Physical Planning&lt;/strong&gt;: Generates multiple physical plans and selects the most efficient one (Cost-Based Optimization).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Code Generation&lt;/strong&gt;: Generates efficient Java bytecode to execute the query (Project Tungsten).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Spark SQL provides the best of both worlds: the ease of use of SQL and the power/scalability of distributed processing. It is the foundation for modern Data Lakehouse architectures.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>06 Spark Streaming - Real-time Data Processing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter06/06_02_Spark_Streaming/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter06/06_02_Spark_Streaming</id>
   <content type="html">&lt;p&gt;Spark Streaming is Apache Spark’s scalable and fault-tolerant stream processing engine that enables processing of live data streams. It extends Spark’s core capabilities to handle real-time data ingestion and processing, making it possible to build end-to-end streaming applications.&lt;/p&gt;

&lt;h2 id=&quot;introduction-to-stream-processing&quot;&gt;Introduction to Stream Processing&lt;/h2&gt;

&lt;h3 id=&quot;batch-vs-stream-processing&quot;&gt;Batch vs Stream Processing&lt;/h3&gt;
&lt;p&gt;Traditional batch processing works with finite datasets, while stream processing handles continuous, unbounded data streams in real-time.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Batch Processing:
┌─────────────────────────────────────────────────────────────┐
│                    Static Dataset                            │
├─────────────────────────────────────────────────────────────┤
│ Input → Process → Output                                    │
│                                                             │
│ • Fixed size data                                           │
│ • High latency (minutes to hours)                           │
│ • High throughput                                           │
│ • Complete results                                          │
└─────────────────────────────────────────────────────────────┘

Stream Processing:
┌─────────────────────────────────────────────────────────────┐
│                 Continuous Data Stream                       │
├─────────────────────────────────────────────────────────────┤
│ Input Stream → Real-time Process → Output Stream           │
│                                                             │
│ • Unbounded data                                            │
│ • Low latency (milliseconds to seconds)                     │
│ • Variable throughput                                       │
│ • Incremental results                                       │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;real-world-streaming-use-cases&quot;&gt;Real-world Streaming Use Cases&lt;/h3&gt;
&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Common streaming applications&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;streamingUseCases&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Financial Services&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;fraudDetection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Credit card transactions&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Real-time anomaly detection&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Fraud alerts within 100ms&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;impact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Prevent financial losses&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// E-commerce&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;recommendationEngine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;User clicks and purchases&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Real-time ML inference&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Personalized recommendations&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;impact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Increase conversion rates&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// IoT and Manufacturing&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;predictiveMaintenance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Sensor data from machinery&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Anomaly detection algorithms&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Maintenance alerts&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;impact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Reduce downtime costs&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// Social Media&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;trendingTopics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Social media posts and interactions&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Real-time aggregation and ranking&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Trending topics and hashtags&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;impact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Engage users with relevant content&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// Gaming&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;realTimeAnalytics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Player actions and game events&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Real-time metrics calculation&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Live dashboards and alerts&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;na&quot;&gt;impact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Optimize game experience&lt;/span&gt;&lt;span class=&quot;dl&quot;&gt;&quot;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;spark-streaming-architecture&quot;&gt;Spark Streaming Architecture&lt;/h2&gt;

&lt;h3 id=&quot;core-concepts&quot;&gt;Core Concepts&lt;/h3&gt;
&lt;p&gt;Spark Streaming works by discretizing the continuous input stream into batches and processing them using Spark’s batch processing engine.&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Spark Streaming Architecture:
┌─────────────────────────────────────────────────────────────┐
│                    Input Sources                             │
├─────────────────┬─────────────────┬─────────────────────────┤
│     Kafka       │      Flume      │    TCP Sockets          │
│   (Messages)    │     (Logs)      │   (Network Data)        │
└─────────────────┴─────────────────┴─────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                 Spark Streaming                             │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │              DStream (Discretized Stream)            │   │
│  │                                                     │   │
│  │  [Batch 1] → [Batch 2] → [Batch 3] → [Batch 4]    │   │
│  │     RDD        RDD        RDD        RDD           │   │
│  └─────────────────────────────────────────────────────┘   │
│                            │                               │
│                            ▼                               │
│  ┌─────────────────────────────────────────────────────┐   │
│  │            Spark Core Engine                        │   │
│  │         (Batch Processing)                          │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   Output Sinks                              │
├─────────────────┬─────────────────┬─────────────────────────┤
│   Databases     │   File Systems  │    Message Queues       │
│  (HDFS, S3)     │   (Local, NFS)  │   (Kafka, RabbitMQ)     │
└─────────────────┴─────────────────┴─────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;dstreams-discretized-streams&quot;&gt;DStreams (Discretized Streams)&lt;/h3&gt;
&lt;p&gt;DStreams are the fundamental abstraction in Spark Streaming, representing a continuous sequence of RDDs.&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# DStream concept illustration
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.streaming&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StreamingContext&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkContext&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DStreamExample&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SparkContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;local[2]&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;StreamingExample&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StreamingContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# 1 second batch interval
&lt;/span&gt;    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;create_dstream_from_socket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Create DStream from TCP socket&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Connect to localhost:9999
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;localhost&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Each &apos;lines&apos; represents a batch of data received in 1 second
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;demonstrate_dstream_operations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Show various DStream operations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;create_dstream_from_socket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Transformation: Split lines into words
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Transformation: Map each word to (word, 1)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Transformation: Count words in each batch
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;word_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByKey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Action: Print results
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;word_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;pprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_counts&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;demonstrate_windowed_operations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Show windowed operations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;create_dstream_from_socket&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Window operation: Count words over last 30 seconds, 
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# updated every 10 seconds
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;windowed_word_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByKeyAndWindow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;# Reduce function
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;# Inverse reduce function
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                      &lt;span class=&quot;c1&quot;&gt;# Window duration (30 seconds)
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;                       &lt;span class=&quot;c1&quot;&gt;# Slide duration (10 seconds)
&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;windowed_word_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;pprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowed_word_counts&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;start_streaming&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Start the streaming context&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;             &lt;span class=&quot;c1&quot;&gt;# Start the computation
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;awaitTermination&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Wait for termination
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;micro-batch-processing-model&quot;&gt;Micro-batch Processing Model&lt;/h3&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Micro-batch Timeline:
Time:    0s    1s    2s    3s    4s    5s    6s
         │     │     │     │     │     │     │
Input:   ████  ████  ████  ████  ████  ████  ████
         │     │     │     │     │     │     │
Batch:   [B1]  [B2]  [B3]  [B4]  [B5]  [B6]  [B7]
         │     │     │     │     │     │     │
Process:  ▼     ▼     ▼     ▼     ▼     ▼     ▼
         RDD1  RDD2  RDD3  RDD4  RDD5  RDD6  RDD7

Characteristics:
• Batch Interval: 1 second (configurable)
• Each batch becomes an RDD
• Processing happens in parallel
• Results available after each batch
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;input-sources-and-data-ingestion&quot;&gt;Input Sources and Data Ingestion&lt;/h2&gt;

&lt;h3 id=&quot;built-in-input-sources&quot;&gt;Built-in Input Sources&lt;/h3&gt;
&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Scala examples of various input sources&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming._&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.kafka010._&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;InputSources&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparkConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// 1. File-based sources&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fileSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Monitor a directory for new files&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;textFileStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hdfs://namenode:port/streaming/input/&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// 2. Socket-based sources  &lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;socketSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// TCP socket connection&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// 3. Kafka integration&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;kafkaSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;ConsumerRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;kafkaParams&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Object&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&quot;bootstrap.servers&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;localhost:9092&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&quot;key.deserializer&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;classOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StringDeserializer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&quot;value.deserializer&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;classOf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StringDeserializer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&quot;group.id&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;streaming-consumer-group&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;s&quot;&gt;&quot;auto.offset.reset&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;latest&quot;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;topics&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;user-events&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;system-logs&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;nv&quot;&gt;KafkaUtils&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;createDirectStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;nc&quot;&gt;PreferConsistent&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;nc&quot;&gt;Subscribe&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;topics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kafkaParams&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// 4. Custom receiver&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;customSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;receiverStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CustomReceiver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;custom-source-url&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Custom receiver implementation&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CustomReceiver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Receiver&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;StorageLevel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;MEMORY_AND_DISK_2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;onStart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Start the thread that receives data&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Custom Receiver&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;receive&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;onStop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Cleanup resources&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;receive&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userInput&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;c1&quot;&gt;// Simulate receiving data&lt;/span&gt;
      &lt;span class=&quot;nf&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(!&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isStopped&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userInput&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;receiveData&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userInput&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;null&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;})&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;store&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Store received data&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Error receiving data&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;receiveData&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Implement actual data receiving logic&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Data received at ${System.currentTimeMillis()}&quot;&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;kafka-integration-deep-dive&quot;&gt;Kafka Integration Deep Dive&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Python Kafka integration example
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.streaming&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StreamingContext&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.types&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;KafkaStreamingProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;KafkaStreamingApp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.streaming.checkpointLocation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/checkpoint&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparkContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;setLogLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;WARN&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;create_kafka_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Create streaming DataFrame from Kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;readStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;localhost:9092&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;subscribe&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user-events,system-logs,transactions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;startingOffsets&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;latest&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process_user_events&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Process user event stream&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;create_kafka_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Define schema for user events
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_event_schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;LongType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;properties&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MapType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Parse JSON messages
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;topic&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user-events&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;from_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_event_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka_timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data.*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka_timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Real-time aggregations
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_activity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parsed_df&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withWatermark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;5 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1 minute&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;countDistinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_users&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_activity&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;detect_anomalies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Real-time anomaly detection&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;create_kafka_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Transaction schema
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;transaction_schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DoubleType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;merchant&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;LongType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kafka_df&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;topic&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transactions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;from_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transaction_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data.*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Detect large transactions (simple rule-based)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;anomalies&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;anomaly_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;large_transaction&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;detected_at&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;current_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;anomalies&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;start_streaming_queries&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Start all streaming queries&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# User activity monitoring
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_activity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;process_user_events&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;activity_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_activity&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;console&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;truncate&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processingTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;30 seconds&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Anomaly detection
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;anomalies&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;detect_anomalies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;anomaly_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;anomalies&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;console&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;truncate&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Wait for termination
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;streams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;awaitAnyTermination&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Usage
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;__main__&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;processor&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;KafkaStreamingProcessor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;processor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start_streaming_queries&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;transformations-and-operations&quot;&gt;Transformations and Operations&lt;/h2&gt;

&lt;h3 id=&quot;stateless-transformations&quot;&gt;Stateless Transformations&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Stateless transformations example
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StatelessTransformations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;basic_transformations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Basic stateless transformations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Map: Transform each element
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;mapped&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Filter: Select elements based on condition
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;filtered&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# FlatMap: Transform and flatten
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Union: Combine multiple DStreams
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;combined&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;union&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;another_dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mapped&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;filtered&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;combined&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;aggregation_transformations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Aggregation transformations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# ReduceByKey: Aggregate by key within each batch
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;word_pairs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;word_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_pairs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByKey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# CountByValue: Count occurrences of each value
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;value_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;countByValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Reduce: Aggregate all elements in each batch
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;total&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduce&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;total&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;join_operations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Join operations between DStreams&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Join: Inner join on keys
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;joined&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dstream2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# LeftOuterJoin: Left outer join
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;left_joined&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;leftOuterJoin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dstream2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# CoGroup: Group values from both streams by key
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;cogrouped&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cogroup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dstream2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;joined&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;left_joined&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cogrouped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;stateful-transformations&quot;&gt;Stateful Transformations&lt;/h3&gt;
&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Stateful transformations in Scala&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.State&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.StateSpec&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StatefulTransformations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// UpdateStateByKey: Maintain state across batches&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;updateStateByKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;updateFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newValues&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Seq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runningCount&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;newCount&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;newValues&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;runningCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;nc&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;nv&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;updateStateByKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;updateFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// MapWithState: More efficient state management&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mapWithState&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mappingFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;State&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;currentCount&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;previousCount&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOption&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;newCount&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;currentCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;previousCount&lt;/span&gt;
      
      &lt;span class=&quot;nv&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;nc&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;newCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;stateSpec&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;StateSpec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mappingFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;initialState&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;initialRDD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Optional initial state&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;numPartitions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// Number of partitions for state&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;timeout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Minutes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;// Timeout inactive keys&lt;/span&gt;
    
    &lt;span class=&quot;nv&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;mapWithState&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stateSpec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// Session-based analytics example&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;sessionAnalytics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userEvents&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startTime&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;endTime&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pageViews&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;actions&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;updateSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;State&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Option&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;get&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;currentSession&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOption&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
      
      &lt;span class=&quot;n&quot;&gt;currentSession&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;match&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
          &lt;span class=&quot;c1&quot;&gt;// Update existing session&lt;/span&gt;
          &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;updatedSession&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;copy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;endTime&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;pageViews&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;pageViews&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;actions&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;actions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:+&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;action&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;nv&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;updatedSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updatedSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
          
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;None&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
          &lt;span class=&quot;c1&quot;&gt;// Create new session&lt;/span&gt;
          &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;newSession&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;startTime&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;endTime&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;pageViews&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;actions&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;event&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;action&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;nv&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;newSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
          &lt;span class=&quot;nc&quot;&gt;Some&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;newSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;stateSpec&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;StateSpec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;updateSession&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;timeout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Minutes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Session timeout&lt;/span&gt;
    
    &lt;span class=&quot;nv&quot;&gt;userEvents&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;mapWithState&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stateSpec&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;window-operations&quot;&gt;Window Operations&lt;/h2&gt;

&lt;h3 id=&quot;time-based-windows&quot;&gt;Time-based Windows&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Window operations example
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.streaming&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StreamingContext&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;timedelta&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;WindowOperations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;sliding_window_example&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Sliding window operations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Count elements in sliding window
&lt;/span&gt;        &lt;span class=&quot;c1&quot;&gt;# Window: 30 seconds, Slide: 10 seconds
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;windowed_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;countByWindow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Reduce over sliding window
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;windowed_sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByWindow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Reduce function
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Inverse reduce function (optional)
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                    &lt;span class=&quot;c1&quot;&gt;# Window duration
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;# Slide duration
&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowed_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowed_sum&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;keyed_window_operations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair_dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Window operations on key-value pairs&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# ReduceByKeyAndWindow: Aggregate by key over window
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;windowed_word_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair_dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByKeyAndWindow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Reduce function
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# Inverse reduce function
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                    &lt;span class=&quot;c1&quot;&gt;# Window duration (60 seconds)
&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;# Slide duration (20 seconds)
&lt;/span&gt;        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# CountByValueAndWindow: Count values over window
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;windowed_value_counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair_dstream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;countByValueAndWindow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowed_word_counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowed_value_counts&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;advanced_window_analytics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Advanced window-based analytics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;calculate_metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Calculate custom metrics for each window&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isEmpty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Convert to DataFrame for complex operations
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;toDF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
                
                &lt;span class=&quot;c1&quot;&gt;# Calculate various metrics
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;total_events&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;unique_users&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;distinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;event_distribution&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
                
                &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Window ending at &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;  Total events: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;total_events&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;  Unique users: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique_users&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;  Event distribution: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;event_distribution&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Apply custom function to each window
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;foreachRDD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;calculate_metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;real_time_dashboard_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Generate data for real-time dashboard&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# 5-minute window, updated every minute
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;dashboard_metrics&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;300&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;calculate_dashboard_metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dashboard_metrics&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;calculate_dashboard_metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Calculate metrics for dashboard&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isEmpty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Convert to DataFrame for SQL operations
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;toDF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;metric_name&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Register as temporary table
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Calculate aggregated metrics using SQL
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparkContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;
            SELECT 
                &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;avg_response_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; as metric,
                AVG(CASE WHEN metric_name = &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;response_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; THEN value END) as value,
                MAX(timestamp) as timestamp
            FROM metrics
            UNION ALL
            SELECT 
                &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;total_requests&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; as metric,
                SUM(CASE WHEN metric_name = &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;request_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; THEN value END) as value,
                MAX(timestamp) as timestamp
            FROM metrics
            UNION ALL
            SELECT 
                &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;error_rate&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; as metric,
                SUM(CASE WHEN metric_name = &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;error_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; THEN value END) / 
                SUM(CASE WHEN metric_name = &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;request_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; THEN value END) * 100 as value,
                MAX(timestamp) as timestamp
            FROM metrics
        &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;output-operations-and-sinks&quot;&gt;Output Operations and Sinks&lt;/h2&gt;

&lt;h3 id=&quot;built-in-output-operations&quot;&gt;Built-in Output Operations&lt;/h3&gt;
&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Java output operations example&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.api.java.*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.api.java.function.*&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;OutputOperations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;basicOutputs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// Print: Output to console (for debugging)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// Print first 20 elements&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// SaveAsTextFiles: Save each RDD to text files&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;saveAsTextFiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hdfs://output/prefix&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;suffix&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// SaveAsObjectFiles: Save as serialized objects&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;saveAsObjectFiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hdfs://output/objects&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;obj&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// SaveAsHadoopFiles: Save using Hadoop OutputFormat&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;saveAsHadoopFiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;hdfs://output/hadoop&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;s&quot;&gt;&quot;part&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;IntWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;TextOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;customOutputs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// ForEach: Apply custom function to each RDD&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;foreachRDD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VoidFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaRDD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
            &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;call&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaRDD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;// Custom processing for each RDD&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;isEmpty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;// Save to database, send to external system, etc.&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;saveToDatabase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
                &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// Transform and save&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;JavaDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;processed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dstream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
            &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;call&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;processRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;processed&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;foreachRDD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Partition-wise processing for better performance&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;rdd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;foreachPartition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partition&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;// Initialize connection per partition&lt;/span&gt;
                &lt;span class=&quot;nc&quot;&gt;DatabaseConnection&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DatabaseConnection&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
                
                &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;hasNext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;next&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;insert&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;);&lt;/span&gt;
                &lt;span class=&quot;err&quot;&gt;}&lt;/span&gt;
                
                &lt;span class=&quot;nc&quot;&gt;conn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;close&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;saveToDatabase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;records&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// Database saving logic&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Saving &quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;records&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; records to database&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;processRecord&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;err&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;err&quot;&gt;//&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Record&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;processing&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;logic&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;toUpperCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;advanced-output-patterns&quot;&gt;Advanced Output Patterns&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Advanced output patterns
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AdvancedOutputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;AdvancedOutputs&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;multi_sink_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;processed_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Output to multiple sinks simultaneously&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;multi_sink_foreach_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epoch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Process each micro-batch&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Cache the DataFrame since we&apos;ll use it multiple times
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
            
            &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;c1&quot;&gt;# Sink 1: Save to Parquet for analytics
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;partitionBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;parquet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s3://data-lake/processed-events/&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                
                &lt;span class=&quot;c1&quot;&gt;# Sink 2: Save alerts to database
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;alerts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;severity&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;HIGH&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alerts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;alerts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;jdbc&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;jdbc:postgresql://db:5432/alerts&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;dbtable&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;real_time_alerts&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;admin&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                        &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;save&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
                
                &lt;span class=&quot;c1&quot;&gt;# Sink 3: Send metrics to monitoring system
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
                
                &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;send_metrics_to_monitoring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epoch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                
                &lt;span class=&quot;c1&quot;&gt;# Sink 4: Update real-time dashboard cache
&lt;/span&gt;                &lt;span class=&quot;n&quot;&gt;dashboard_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;region&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
                    &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                        &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                        &lt;span class=&quot;nf&quot;&gt;avg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;processing_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;avg_processing_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                
                &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;update_dashboard_cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dashboard_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                
            &lt;span class=&quot;k&quot;&gt;finally&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpersist&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Start the streaming query with custom foreach batch
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;processed_stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;foreachBatch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;multi_sink_foreach_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processingTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10 seconds&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;exactly_once_delivery&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Implement exactly-once delivery semantics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;idempotent_write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epoch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Idempotent write operation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Add unique identifier for each batch
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;df_with_batch_id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;batch_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;epoch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Use upsert operation (merge) instead of append
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;df_with_batch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;batch_data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Merge logic to handle duplicates
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;sql&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;
                MERGE INTO target_table t
                USING batch_data s
                ON t.id = s.id AND t.batch_id = s.batch_id
                WHEN NOT MATCHED THEN
                    INSERT (id, data, batch_id, processed_at)
                    VALUES (s.id, s.data, s.batch_id, current_timestamp())
            &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;foreachBatch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;idempotent_write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;checkpointLocation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/checkpoint/exactly-once&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;send_metrics_to_monitoring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Send metrics to external monitoring system&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;monitoring_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;datetime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;now&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isoformat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;batch_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]}&lt;/span&gt; 
                       &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;row&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Send to monitoring system (e.g., Prometheus, CloudWatch)
&lt;/span&gt;        &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Sending metrics: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;dumps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;monitoring_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;indent&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;update_dashboard_cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dashboard_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Update real-time dashboard cache&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Write to Redis or similar cache
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;dashboard_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;org.apache.spark.sql.redis&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;dashboard_metrics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;key.column&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;region&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;overwrite&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;save&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;performance-optimization&quot;&gt;Performance Optimization&lt;/h2&gt;

&lt;h3 id=&quot;tuning-parameters&quot;&gt;Tuning Parameters&lt;/h3&gt;
&lt;div class=&quot;language-scala highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Performance tuning configuration&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.SparkConf&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PerformanceTuning&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;optimizedSparkConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;SparkConf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SparkConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;setAppName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;OptimizedStreamingApp&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Memory settings&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.executor.memory&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;4g&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.executor.memoryFraction&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;0.8&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.receiver.maxRate&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;10000&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Max records per second per receiver&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Batch interval optimization&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.blockInterval&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;200ms&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// Block interval for receivers&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Backpressure (dynamic rate limiting)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.backpressure.enabled&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.backpressure.initialRate&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;1000&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Kafka-specific optimizations&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.kafka.maxRatePerPartition&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;2000&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Checkpointing&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.streaming.stopGracefullyOnShutdown&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Serialization&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.serializer&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;org.apache.spark.serializer.KryoSerializer&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
      
      &lt;span class=&quot;c1&quot;&gt;// Garbage collection&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;spark.executor.extraJavaOptions&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; 
           &lt;span class=&quot;s&quot;&gt;&quot;-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCDetails&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;createOptimizedStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;StreamingContext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;conf&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;optimizedSparkConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// 2-second batch interval&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// Set checkpoint directory for fault tolerance&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;checkpoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hdfs://namenode:port/streaming/checkpoint&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;c1&quot;&gt;// Optimization techniques&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;optimizationTechniques&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;StreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// 1. Appropriate batch interval&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Rule of thumb: processing time should be &amp;lt; batch interval&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// 2. Parallelism optimization&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;inputDStream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// Repartition if needed to increase parallelism&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;repartitioned&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;inputDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;repartition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// 3. Caching for iterative operations&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;repartitioned&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Cache if used multiple times&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// 4. Efficient state management&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;wordCounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;reduceByKeyAndWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// Reduce function&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// Inverse reduce function (more efficient)&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;Seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// Window duration&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;Seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;// Slide duration&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;c1&quot;&gt;// 5. Broadcast variables for lookup tables&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;broadcastLookup&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;ssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;sparkContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;broadcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;loadLookupTable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;enriched&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;lookup&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;broadcastLookup&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;value&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;lookup&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;unknown&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;loadLookupTable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Load lookup data&lt;/span&gt;
    &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hello&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;greeting&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;world&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;earth&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;memory-management&quot;&gt;Memory Management&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Memory management strategies
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MemoryManagement&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;configure_memory_settings&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Configure memory settings for streaming&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;spark_conf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;# Executor memory settings
&lt;/span&gt;            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.executor.memory&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;8g&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.executor.memoryFraction&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;0.75&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.executor.cores&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Storage memory settings
&lt;/span&gt;            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.streaming.stateStore.maintenanceInterval&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;60s&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.streaming.stateStore.minDeltasForSnapshot&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Shuffle settings
&lt;/span&gt;            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.shuffle.partitions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.serializer&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;org.apache.spark.serializer.KryoSerializer&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Garbage collection
&lt;/span&gt;            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.executor.extraJavaOptions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; 
                &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+PrintGCTimeStamps&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spark_conf&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;memory_efficient_processing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Memory-efficient stream processing patterns&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# 1. Process data in smaller batches
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process_partition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Process partition efficiently&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;batch_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
            
            &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;partition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                
                &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                    &lt;span class=&quot;c1&quot;&gt;# Process batch
&lt;/span&gt;                    &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;process_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Process remaining records
&lt;/span&gt;            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;process_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# 2. Use mapPartitions for memory efficiency
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;efficient_stream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stream&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;mapPartitions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;process_partition&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# 3. Avoid collecting large datasets
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;safe_foreach_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epoch_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Safe processing without collecting all data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;# Process in chunks instead of collecting all
&lt;/span&gt;            &lt;span class=&quot;n&quot;&gt;total_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;chunk_size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;
            
            &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;total_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chunk_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;chunk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;limit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunk_size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;process_chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;efficient_stream&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process_batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Process a batch of records&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Implement batch processing logic
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;upper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;batch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;process_chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Process a chunk of data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Implement chunk processing logic
&lt;/span&gt;        &lt;span class=&quot;nf&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Processing chunk of &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chunk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; records&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;fault-tolerance-and-checkpointing&quot;&gt;Fault Tolerance and Checkpointing&lt;/h2&gt;

&lt;h3 id=&quot;checkpointing-mechanism&quot;&gt;Checkpointing Mechanism&lt;/h3&gt;
&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Checkpointing and fault tolerance&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;org.apache.spark.streaming.api.java.JavaStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;FaultTolerance&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;createStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// Function to create new StreamingContext&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;Function0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;createContext&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;SparkConf&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SparkConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
                &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;setAppName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;FaultTolerantStreamingApp&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
                &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;setMaster&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;local[*]&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            
            &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Durations&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;seconds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;// Set checkpoint directory&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;checkpoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;// Define streaming computation&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;JavaDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            
            &lt;span class=&quot;nc&quot;&gt;JavaDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lines&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; 
                &lt;span class=&quot;nc&quot;&gt;Arrays&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;asList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot; &quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;iterator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
            
            &lt;span class=&quot;nc&quot;&gt;JavaPairDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;mapToPair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; 
                &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
            
            &lt;span class=&quot;nc&quot;&gt;JavaPairDStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wordCounts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pairs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;reduceByKey&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            
            &lt;span class=&quot;n&quot;&gt;wordCounts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
            
            &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;};&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;// Get or create StreamingContext from checkpoint&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;createContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;gracefulShutdown&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// Add shutdown hook for graceful termination&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;Runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addShutdownHook&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Shutting down streaming application gracefully...&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;stop&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Stop gracefully, wait for completion&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Application stopped.&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}));&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;awaitTermination&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    
    &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;recoverFromFailure&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;nc&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;hdfs://namenode:port/streaming/checkpoint&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;createStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;gracefulShutdown&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
            
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Failed to start streaming context: &quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getMessage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
            
            &lt;span class=&quot;c1&quot;&gt;// Implement retry logic&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxRetries&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
            &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retryCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
            
            &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;retryCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxRetries&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;nc&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5000&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Wait before retry&lt;/span&gt;
                    &lt;span class=&quot;nc&quot;&gt;JavaStreamingContext&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;createStreamingContext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;checkpointDir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;gracefulShutdown&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jssc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
                    &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
                    
                &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retryException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;retryCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++;&lt;/span&gt;
                    &lt;span class=&quot;nc&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;err&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Retry &quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;retryCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot; failed: &quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; 
                                     &lt;span class=&quot;n&quot;&gt;retryException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getMessage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
                &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;real-world-example-real-time-analytics-pipeline&quot;&gt;Real-world Example: Real-time Analytics Pipeline&lt;/h2&gt;

&lt;h3 id=&quot;complete-end-to-end-example&quot;&gt;Complete End-to-End Example&lt;/h3&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Complete real-time analytics pipeline
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark.sql.types&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;RealTimeAnalyticsPipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkSession&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;builder&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;appName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;RealTimeAnalyticsPipeline&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.streaming.checkpointLocation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/checkpoint&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;spark.sql.streaming.stateStore.maintenanceInterval&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;60s&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;getOrCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sparkContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;setLogLevel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;WARN&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setup_schemas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Define schemas for different event types&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_event_schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_url&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;LongType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_agent&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ip_address&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transaction_schema&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StructType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DoubleType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;currency&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;merchant_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;LongType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
            &lt;span class=&quot;nc&quot;&gt;StructField&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;payment_method&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StringType&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;create_input_streams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Create input streams from Kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# User events stream
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_events_raw&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;readStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;localhost:9092&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;subscribe&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user-events&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;startingOffsets&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;latest&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Parse user events
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_events_raw&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;from_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_event_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka_timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data.*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka_timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;from_unixtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withWatermark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Transaction events stream
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;transactions_raw&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;readStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;localhost:9092&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;subscribe&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transactions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;startingOffsets&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;latest&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Parse transactions
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transactions_raw&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;from_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transaction_schema&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;data.*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;from_unixtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;timestamp&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withWatermark&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;5 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;real_time_user_analytics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Real-time user behavior analytics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Page view analytics
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;page_views&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_view&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;5 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1 minute&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_url&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;view_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;countDistinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_visitors&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;countDistinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_sessions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;avg_views_per_visitor&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                       &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;view_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_visitors&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# User session analytics
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;session_analytics&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;2 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;events_in_session&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;collect_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_url&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_sequence&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;min&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_start&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_end&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_duration_minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                       &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unix_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_end&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;unix_timestamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;session_start&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;page_views&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_analytics&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fraud_detection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Real-time fraud detection&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Detect suspicious transaction patterns
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;suspicious_transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1 minute&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;total_amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;countDistinct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;merchant_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_merchants&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;collect_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;payment_method&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;payment_methods&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# Too many transactions
&lt;/span&gt;                &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;total_amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# High amount
&lt;/span&gt;                &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_merchants&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;      &lt;span class=&quot;c1&quot;&gt;# Too many different merchants
&lt;/span&gt;            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;withColumn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;fraud_score&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
                       &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; 
                       &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;total_amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; 
                       &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;unique_merchants&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;fraud_score&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;5.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;suspicious_transactions&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;real_time_recommendations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Generate real-time recommendations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# User behavior patterns
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;user_patterns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_events&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_type&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;isin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_view&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;click&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;purchase&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;event_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;30 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;5 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;collect_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;page_url&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;visited_pages&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;activity_level&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Join with transaction data for purchase behavior
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;purchase_behavior&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transactions&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;transaction_time&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;30 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;5 minutes&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;col&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;collect_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;merchant_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;purchased_from&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;nf&quot;&gt;avg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;avg_purchase_amount&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Combine for recommendations
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;recommendation_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_patterns&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;purchase_behavior&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;user_id&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
            &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;left_outer&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recommendation_data&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;setup_outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Setup output sinks&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;page_views&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_analytics&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;real_time_user_analytics&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;suspicious_transactions&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;fraud_detection&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;recommendations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;real_time_recommendations&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Output 1: Page views to console (for monitoring)
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;page_views_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;page_views&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;console&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;truncate&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processingTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;30 seconds&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Output 2: Fraud alerts to database
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;fraud_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;suspicious_transactions&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;console&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;truncate&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processingTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;10 seconds&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Output 3: Session analytics to Parquet files
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;session_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_analytics&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;outputMode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;parquet&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/session-analytics&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;checkpointLocation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/checkpoint/sessions&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processingTime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;60 seconds&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Output 4: Recommendations to Kafka
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;recommendations_query&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recommendations&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;to_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;alias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;writeStream&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;localhost:9092&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;topic&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;recommendations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;checkpointLocation&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/tmp/checkpoint/recommendations&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
            &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page_views_query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fraud_query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;session_query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recommendations_query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;run_pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Run the complete pipeline&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&quot;&quot;&lt;/span&gt;
        
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;setup_schemas&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;create_input_streams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;queries&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;setup_outputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        
        &lt;span class=&quot;c1&quot;&gt;# Wait for all queries to terminate
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;queries&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;query&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;awaitTermination&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Usage
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;__main__&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;RealTimeAnalyticsPipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;run_pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Spark Streaming provides a powerful platform for real-time data processing with the following key advantages:&lt;/p&gt;

&lt;h3 id=&quot;key-benefits&quot;&gt;Key Benefits&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Unified Platform&lt;/strong&gt;: Same API for batch and stream processing&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fault Tolerance&lt;/strong&gt;: Automatic recovery from failures&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Linear scalability with cluster size&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Integration&lt;/strong&gt;: Rich ecosystem integration (Kafka, HDFS, databases)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Exactly-once Processing&lt;/strong&gt;: Strong consistency guarantees&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;best-practices&quot;&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Choose appropriate batch intervals based on latency requirements&lt;/li&gt;
  &lt;li&gt;Use structured streaming for complex event processing&lt;/li&gt;
  &lt;li&gt;Implement proper checkpointing for fault tolerance&lt;/li&gt;
  &lt;li&gt;Monitor and tune performance continuously&lt;/li&gt;
  &lt;li&gt;Design for exactly-once semantics when needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;when-to-use-spark-streaming&quot;&gt;When to Use Spark Streaming&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Real-time Analytics&lt;/strong&gt;: Dashboard updates, metrics calculation&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ETL Pipelines&lt;/strong&gt;: Continuous data transformation and loading&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fraud Detection&lt;/strong&gt;: Real-time anomaly detection&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;IoT Processing&lt;/strong&gt;: Sensor data analysis&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Log Processing&lt;/strong&gt;: Real-time log analysis and alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the next chapter, we’ll explore Spark SQL for structured data processing and how it complements streaming workloads.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>05 Apache Spark Fundamentals</title>
   <link href="https://nglelinh.github.io/contents/en/chapter05/05_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter05/05_Introduction</id>
   <content type="html">&lt;p&gt;Apache Spark is a unified analytics engine for large-scale data processing, offering significant performance improvements over traditional MapReduce through in-memory computing and advanced optimization techniques.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Understand Spark architecture and core concepts&lt;/li&gt;
  &lt;li&gt;Learn RDD (Resilient Distributed Dataset) programming&lt;/li&gt;
  &lt;li&gt;Explore Spark’s execution model and optimization&lt;/li&gt;
  &lt;li&gt;Compare Spark with MapReduce&lt;/li&gt;
  &lt;li&gt;Set up and configure Spark clusters&lt;/li&gt;
  &lt;li&gt;Write Spark applications in Scala, Python, and Java&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;key-topics&quot;&gt;Key Topics&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Spark Core Architecture&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Driver and executor processes&lt;/li&gt;
      &lt;li&gt;Cluster managers (Standalone, YARN, Mesos, Kubernetes)&lt;/li&gt;
      &lt;li&gt;Spark Context and Spark Session&lt;/li&gt;
      &lt;li&gt;Memory management and caching&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;RDD Programming Model&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Creating RDDs from data sources&lt;/li&gt;
      &lt;li&gt;Transformations vs Actions&lt;/li&gt;
      &lt;li&gt;Lazy evaluation and lineage&lt;/li&gt;
      &lt;li&gt;Partitioning and data locality&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Performance and Optimization&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;In-memory computing advantages&lt;/li&gt;
      &lt;li&gt;Catalyst optimizer&lt;/li&gt;
      &lt;li&gt;Tungsten execution engine&lt;/li&gt;
      &lt;li&gt;Broadcast variables and accumulators&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;
</content>
 </entry>
 
 <entry>
   <title>05-01 Apache Spark Fundamentals</title>
   <link href="https://nglelinh.github.io/contents/en/chapter05/05_01_Spark_Fundamentals/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter05/05_01_Spark_Fundamentals</id>
   <content type="html">&lt;p&gt;Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs.&lt;/p&gt;

&lt;h2 id=&quot;why-spark&quot;&gt;Why Spark?&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Limitations of MapReduce:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Slow&lt;/strong&gt;: Heavy reliance on disk I/O (reads/writes to HDFS for every stage).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Inefficient&lt;/strong&gt;: Not good for iterative algorithms (ML, graph processing).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Complex&lt;/strong&gt;: Verbose code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Spark Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Speed&lt;/strong&gt;: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Ease of Use&lt;/strong&gt;: Write applications quickly in Java, Scala, Python, R, and SQL.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Generality&lt;/strong&gt;: Combine SQL, streaming, and complex analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;spark-architecture&quot;&gt;Spark Architecture&lt;/h2&gt;

&lt;h3 id=&quot;components&quot;&gt;Components&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Driver&lt;/strong&gt;: The process running the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main()&lt;/code&gt; function of the application and creating the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SparkContext&lt;/code&gt;. It schedules tasks.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Executor&lt;/strong&gt;: A distributed agent responsible for executing tasks. It runs in a JVM on a worker node.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cluster Manager&lt;/strong&gt;: External service for acquiring resources (Standalone, YARN, Mesos, Kubernetes).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;execution-flow&quot;&gt;Execution Flow&lt;/h3&gt;
&lt;ol&gt;
  &lt;li&gt;Driver converts user code into valid tasks.&lt;/li&gt;
  &lt;li&gt;Driver connects to Cluster Manager to negotiate resources.&lt;/li&gt;
  &lt;li&gt;Cluster Manager launches executors on worker nodes.&lt;/li&gt;
  &lt;li&gt;Driver sends tasks to executors.&lt;/li&gt;
  &lt;li&gt;Executors execute tasks and return results to the driver.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;rdd-resilient-distributed-dataset&quot;&gt;RDD (Resilient Distributed Dataset)&lt;/h2&gt;

&lt;p&gt;The primary data abstraction in Spark.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Resilient&lt;/strong&gt;: Fault-tolerant (recomputes missing partitions using lineage).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Distributed&lt;/strong&gt;: Data resides on multiple nodes.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: Collection of objects.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;characteristics&quot;&gt;Characteristics&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Immutable&lt;/strong&gt;: Once created, cannot be changed.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Lazy Evaluation&lt;/strong&gt;: Transformations are not executed immediately.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cacheable&lt;/strong&gt;: Can be persisted in memory for fast reuse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;rdd-operations&quot;&gt;RDD Operations&lt;/h3&gt;

&lt;h4 id=&quot;1-transformations-lazy&quot;&gt;1. Transformations (Lazy)&lt;/h4&gt;
&lt;p&gt;Create a new RDD from an existing one.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;map(func)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;filter(func)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;flatMap(func)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;groupByKey()&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reduceByKey(func)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;2-actions-eager&quot;&gt;2. Actions (Eager)&lt;/h4&gt;
&lt;p&gt;Return a value to the driver program after running a computation on the dataset.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;count()&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;collect()&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;take(n)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;saveAsTextFile(path)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;lazy-evaluation&quot;&gt;Lazy Evaluation&lt;/h3&gt;

&lt;p&gt;Spark records transformations as a &lt;strong&gt;DAG (Directed Acyclic Graph)&lt;/strong&gt; but does nothing until an action is called.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Allows Spark to optimize the execution plan (e.g., pipelining maps and filters).&lt;/li&gt;
  &lt;li&gt;Reduces unneeded data transfer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;example-word-count-in-pyspark&quot;&gt;Example: Word Count in PySpark&lt;/h2&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pyspark&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SparkContext&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SparkContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;local&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Word Count&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 1. Load Data (Transformation)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text_file&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;textFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;input.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 2. Transformations (Lazy)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counts&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; \
             &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; \
             &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;reduceByKey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# 3. Action (Trigger Execution)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;saveAsTextFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;spark-vs-hadoop-mapreduce&quot;&gt;Spark vs. Hadoop MapReduce&lt;/h2&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Feature&lt;/th&gt;
      &lt;th&gt;Hadoop MapReduce&lt;/th&gt;
      &lt;th&gt;Apache Spark&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Processing&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Disk-based (Iterative writes)&lt;/td&gt;
      &lt;td&gt;In-memory (Cacheable)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Slower&lt;/td&gt;
      &lt;td&gt;Up to 100x faster&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Difficulty&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;High (Verbose Java)&lt;/td&gt;
      &lt;td&gt;Low (High-level APIs)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Use cases&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Batch processing&lt;/td&gt;
      &lt;td&gt;Batch, Streaming, ML, Interactive&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Apache Spark is the successor to MapReduce for most modern big data workloads. Its in-memory capability and rich ecosystem (SQL, Streaming, MLlib) make it a versatile tool for data engineers and data scientists.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>04 Introduction to Hadoop MapReduce</title>
   <link href="https://nglelinh.github.io/contents/en/chapter04/04_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter04/04_Introduction</id>
   <content type="html">&lt;p&gt;This chapter delves into MapReduce, the programming paradigm that popularized big data processing on commodity hardware.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Understand the MapReduce programming model (Map, Shuffle, Reduce)&lt;/li&gt;
  &lt;li&gt;Write MapReduce programs to solve parallelizable problems (e.g., Word Count)&lt;/li&gt;
  &lt;li&gt;Analyze the flow of data: InputSplit → Mapper → Partitioner → Reducer → Output&lt;/li&gt;
  &lt;li&gt;Understand how MapReduce achieves fault tolerance through re-execution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;the-paradigm-shift&quot;&gt;The Paradigm Shift&lt;/h2&gt;

&lt;p&gt;MapReduce simplified distributed computing by abstracting the complexities of parallelization, fault tolerance, data distribution, and load balancing. Programmers simply define a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Map&lt;/code&gt; function (to process data) and a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Reduce&lt;/code&gt; function (to aggregate results), and the framework handles the rest.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>04-01 Hadoop MapReduce</title>
   <link href="https://nglelinh.github.io/contents/en/chapter04/04_01_MapReduce/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter04/04_01_MapReduce</id>
   <content type="html">&lt;p&gt;MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.&lt;/p&gt;

&lt;h2 id=&quot;motivation&quot;&gt;Motivation&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Processing large amounts of raw data (TB to PB).&lt;/li&gt;
  &lt;li&gt;Conceptually straightforward computations (e.g., counting, sorting).&lt;/li&gt;
  &lt;li&gt;Must finish in a reasonable time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Constraints:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Processor Speed&lt;/strong&gt;: Moore’s Law is slowing down (frequency stagnation).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Parrelism&lt;/strong&gt;: Limits to local parallelism (heat, complexity).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Distributed Computing&lt;/strong&gt;: The only way to scale is horizontal scaling (adding more commodity machines).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt;
Restrict the programming model to simple operations that can be automatically parallelized.&lt;/p&gt;

&lt;h2 id=&quot;what-is-mapreduce&quot;&gt;What is MapReduce?&lt;/h2&gt;

&lt;p&gt;Proposed by Google in 2004, MapReduce hides the messy details of distributed systems:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Parallelization&lt;/li&gt;
  &lt;li&gt;Fault-tolerance&lt;/li&gt;
  &lt;li&gt;Data distribution&lt;/li&gt;
  &lt;li&gt;Load balancing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;the-model&quot;&gt;The Model&lt;/h3&gt;

&lt;p&gt;A MapReduce job usually splits the input data-set into independent chunks which are processed by the &lt;strong&gt;map tasks&lt;/strong&gt; in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the &lt;strong&gt;reduce tasks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Functions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Map&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(key_1, value_1) -&amp;gt; list(key_2, value_2)&lt;/code&gt;
    &lt;ul&gt;
      &lt;li&gt;takes an input pair and produces a set of intermediate key/value pairs.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reduce&lt;/strong&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(key_2, list(value_2)) -&amp;gt; list(key_3, value_3)&lt;/code&gt;
    &lt;ul&gt;
      &lt;li&gt;takes an intermediate key and a set of values for that key and merges them to form a smaller set of values.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;example-word-count&quot;&gt;Example: Word Count&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Count the occurrences of each word in a large document collection.&lt;/p&gt;

&lt;h3 id=&quot;input&quot;&gt;Input&lt;/h3&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Apple Banana Apple Cherry Banana Apple&quot;&lt;/code&gt;&lt;/p&gt;

&lt;h3 id=&quot;map-phase&quot;&gt;Map Phase&lt;/h3&gt;
&lt;p&gt;The mapper processes the input and emits a key-value pair for each word:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Apple&quot;, 1)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Banana&quot;, 1)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Apple&quot;, 1)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Cherry&quot;, 1)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Banana&quot;, 1)&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Apple&quot;, 1)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;shuffle-and-sort-phase&quot;&gt;Shuffle and Sort Phase&lt;/h3&gt;
&lt;p&gt;The framework groups values by key:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Apple&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[1, 1, 1]&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Banana&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[1, 1]&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Cherry&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;[1]&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;reduce-phase&quot;&gt;Reduce Phase&lt;/h3&gt;
&lt;p&gt;The reducer sums the values for each key:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Apple&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Banana&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&quot;Cherry&quot;&lt;/code&gt; -&amp;gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;python-pseudocode&quot;&gt;Python Pseudocode&lt;/h3&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Mapper
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# key: document name
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# value: document contents
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
        &lt;span class=&quot;nf&quot;&gt;emitIntermediate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Reducer
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;reduce&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# key: word
&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;# values: list of counts
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;emit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;advanced-concepts&quot;&gt;Advanced Concepts&lt;/h2&gt;

&lt;h3 id=&quot;combiner&quot;&gt;Combiner&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Goal&lt;/strong&gt;: Reduce network traffic.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Function&lt;/strong&gt;: Runs on the map node (mini-reducer). Aggregates data locally before sending to the reducer.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Example&lt;/em&gt;: In Word Count, a mapper might emit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Apple&quot;, 1)&lt;/code&gt; three times. A combiner sums them to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;Apple&quot;, 3)&lt;/code&gt; before sending over the network.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;data-locality&quot;&gt;Data Locality&lt;/h3&gt;
&lt;p&gt;MapReduce tries to run the map task on the node where the data resides (HDFS block).&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;“Moving computation is cheaper than moving data.”&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;fault-tolerance&quot;&gt;Fault Tolerance&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Worker Failure&lt;/strong&gt;: If a worker node fails, the master re-schedules the task on another worker. Re-execution is possible because Map and Reduce functions are assumed to be deterministic.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Master Failure&lt;/strong&gt;: Rare. Traditionally a single point of failure (in MRv1), but addressed in YARN (MRv2) with High Availability.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;MapReduce provides a simple yet powerful model for large-scale data processing. By forcing the computation into Map and Reduce phases, the framework can handle the complexities of distributed execution, allowing developers to focus on the logic.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>03 Introduction to Computing Models and YARN</title>
   <link href="https://nglelinh.github.io/contents/en/chapter03/03_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter03/03_Introduction</id>
   <content type="html">&lt;p&gt;This chapter explores various Distributed Computing Models and introduces Apache Hadoop’s resource management layer, YARN.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Distinguish between different computing models: Client-Server, P2P, Event-Driven&lt;/li&gt;
  &lt;li&gt;Understand the core components of the Hadoop Ecosystem&lt;/li&gt;
  &lt;li&gt;Explain the role of HDFS (Storage) and YARN (Resource Management)&lt;/li&gt;
  &lt;li&gt;Describe how YARN decouples resource scheduling from data processing applications&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-yarn&quot;&gt;Why YARN?&lt;/h2&gt;

&lt;p&gt;In the early days of Hadoop (v1), MapReduce was the only processing engine. YARN (Yet Another Resource Negotiator) was introduced in Hadoop v2 to create a general-purpose cluster operating system, allowing multiple applications (MapReduce, Spark, Flink) to run on the same shared cluster resources.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>03-01 Computing Models and Hadoop YARN</title>
   <link href="https://nglelinh.github.io/contents/en/chapter03/03_01_Computing_Models_and_YARN/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter03/03_01_Computing_Models_and_YARN</id>
   <content type="html">&lt;p&gt;This lecture covers fundamental distributed computing models and introduces the Hadoop ecosystem with a focus on YARN for resource management.&lt;/p&gt;

&lt;h2 id=&quot;distributed-computing-models&quot;&gt;Distributed Computing Models&lt;/h2&gt;

&lt;h3 id=&quot;1-client-server-model&quot;&gt;1. Client-Server Model&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Structure&lt;/strong&gt;: Clients request services, servers provide them.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: Web applications, database servers, file servers.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Centralized control, easier maintenance.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Single point of failure (server), scalability bottlenecks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-peer-to-peer-p2p-systems&quot;&gt;2. Peer-to-Peer (P2P) Systems&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Structure&lt;/strong&gt;: All nodes are equal (peers) and share resources directly.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: BitTorrent, Blockchain (Bitcoin/Ethereum).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: High scalability, no single point of failure.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Difficult management, security challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-multi-tier-architecture&quot;&gt;3. Multi-Tier Architecture&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Structure&lt;/strong&gt;: Split efficient processing into layers (tiers).
    &lt;ul&gt;
      &lt;li&gt;Presentation Tier (UI)&lt;/li&gt;
      &lt;li&gt;Application Tier (Logic)&lt;/li&gt;
      &lt;li&gt;Data Tier (Database)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: Modern web applications (React + Node.js + MongoDB).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-thin-clients--compute-servers&quot;&gt;4. Thin Clients &amp;amp; Compute Servers&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Thin Client&lt;/strong&gt;: Lightweight device relying on a server (e.g., Chromebooks, VDI).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Compute Server&lt;/strong&gt;: Powerful backend processing simple requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;5-event-driven-architecture&quot;&gt;5. Event-Driven Architecture&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Structure&lt;/strong&gt;: Asynchronous communication based on events/messages.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: IoT systems, real-time analytics, serverless functions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;introduction-to-hadoop&quot;&gt;Introduction to Hadoop&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Apache Hadoop&lt;/strong&gt; is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.&lt;/p&gt;

&lt;h3 id=&quot;key-characteristics&quot;&gt;Key Characteristics&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scalable&lt;/strong&gt;: From single servers to thousands of machines.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fault-Tolerant&lt;/strong&gt;: Handles hardware failures automatically.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Shared Nothing Architecture&lt;/strong&gt;: Each node is independent.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data Locality&lt;/strong&gt;: Move computation to data, not data to computation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;hadoop-core-components&quot;&gt;Hadoop Core Components&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;HDFS&lt;/strong&gt; (Hadoop Distributed File System): Storage layer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;YARN&lt;/strong&gt; (Yet Another Resource Negotiator): Resource management layer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;MapReduce&lt;/strong&gt;: Data processing model.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hadoop Common&lt;/strong&gt;: Utilities.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;hdfs-architecture&quot;&gt;HDFS Architecture&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HDFS&lt;/strong&gt; is designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;NameNode (Master)&lt;/strong&gt;: Manages metadata (file names, permissions, location of blocks).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;DataNode (Worker)&lt;/strong&gt;: Stores actual data blocks.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Block Size&lt;/strong&gt;: Default 128MB. Large blocks minimize seek time.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Replication&lt;/strong&gt;: Default 3 replicas (for fault tolerance).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;fault-tolerance&quot;&gt;Fault Tolerance&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Heartbeats&lt;/strong&gt;: DataNodes send signals to NameNode every 3 seconds.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Re-replication&lt;/strong&gt;: If a DateNode fails, NameNode schedules replication of its blocks to other nodes.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Rack Awareness&lt;/strong&gt;: Places replicas on different racks to survive rack failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;hadoop-yarn&quot;&gt;Hadoop YARN&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;YARN&lt;/strong&gt; decoupled the resource management and job scheduling capabilities from the original MapReduce.&lt;/p&gt;

&lt;h3 id=&quot;architecture&quot;&gt;Architecture&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;ResourceManager (RM)&lt;/strong&gt;: Global master that arbitrates resources among all applications in the system.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;NodeManager (NM)&lt;/strong&gt;: Per-machine agent responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting to the RM.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;ApplicationMaster (AM)&lt;/strong&gt;: Per-application library that negotiates resources from the RM and works with the NM to execute and monitor the tasks.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Container&lt;/strong&gt;: Abstract notion of resources (memory, cpu, disk, network).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;how-yarn-works&quot;&gt;How YARN Works&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Client submits an application.&lt;/li&gt;
  &lt;li&gt;ResourceManager allocates a container to start the ApplicationMaster.&lt;/li&gt;
  &lt;li&gt;ApplicationMaster asks ResourceManager for more containers.&lt;/li&gt;
  &lt;li&gt;ApplicationMaster contacts NodeManagers to start tasks in allocated containers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;schedulers&quot;&gt;Schedulers&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;FIFO Scheduler&lt;/strong&gt;: First In, First Out. Simple but not efficient for shared clusters.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Capacity Scheduler&lt;/strong&gt;: Designed for multi-tenancy. Queues get a guaranteed capacity.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fair Scheduler&lt;/strong&gt;: Assigns resources so applications get an equal share of resources over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Computing Models&lt;/strong&gt; (Client-Server, P2P) define how distributed components interact.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hadoop&lt;/strong&gt; revolutionized big data by combining storage (HDFS) and processing on commodity hardware.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;YARN&lt;/strong&gt; enables Hadoop to support multiple processing engines (MapReduce, Spark, Tez) by separating resource management from execution logic.&lt;/li&gt;
&lt;/ul&gt;
</content>
 </entry>
 
 <entry>
   <title>02 Introduction to Distributed Systems</title>
   <link href="https://nglelinh.github.io/contents/en/chapter02/02_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter02/02_Introduction</id>
   <content type="html">&lt;p&gt;This chapter introduces the fundamental concepts of Distributed Systems, which form the invisible backbone of modern cloud computing and big data processing. Before we can understand the cloud, we must understand the systems that make it possible.&lt;/p&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;In this chapter, we will:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Define what a Distributed System is and, crucially, what it isn’t.&lt;/li&gt;
  &lt;li&gt;Understand the key characteristics that make them unique: concurrency, the lack of a global clock, and independent failures.&lt;/li&gt;
  &lt;li&gt;Compare centralized vs. distributed architectures to see why we moved away from mainframes.&lt;/li&gt;
  &lt;li&gt;Explore basic design issues including Naming, Communication, and Reliability.&lt;/li&gt;
  &lt;li&gt;Analyze real-world examples from the Local Area Networks (LANs) in your office to the massive Cloud Clusters that power the internet.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-distributed-systems&quot;&gt;Why Distributed Systems?&lt;/h2&gt;

&lt;p&gt;Single machines have reached their physical limits (vertical scaling). To handle the staggering scale of modern internet applications—serving billions of users and processing petabytes of data—we have no choice but to coordinate thousands of machines working together (horizontal scaling). This chapter explains the theoretical and practical foundations of how we achieve this coordination.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>02-01 Distributed Systems Fundamentals</title>
   <link href="https://nglelinh.github.io/contents/en/chapter02/02_01_Distributed_Systems/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter02/02_01_Distributed_Systems</id>
   <content type="html">&lt;p&gt;A distributed system consists of independent computers that communicate to achieve a common goal. This lecture covers the fundamental concepts, characteristics, and design issues of distributed systems.&lt;/p&gt;

&lt;h2 id=&quot;what-is-a-distributed-system&quot;&gt;What is a Distributed System?&lt;/h2&gt;

&lt;h3 id=&quot;definition-and-core-concept&quot;&gt;Definition and Core Concept&lt;/h3&gt;
&lt;p&gt;Understanding the precise nature of a distributed system is the first step in mastering cloud computing. A widely accepted definition states that &lt;strong&gt;“a distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This seemingly simple definition has profound implications. First, it implies the &lt;strong&gt;concurrency of components&lt;/strong&gt;, meaning that multiple processes are executing simultaneously across the network. Second, it highlights the &lt;strong&gt;lack of a global clock&lt;/strong&gt;. In a distributed environment, keeping time synchronized between machines is a notoriously difficult computer science problem, making coordination a challenge. Finally, it introduces the reality of &lt;strong&gt;independent failures&lt;/strong&gt;. Unlike a monolithic system where a crash often takes down everything, in a distributed system, one component can fail completely while others continue to function normally.&lt;/p&gt;

&lt;p&gt;Leslie Lamport, a Turing Award winner and pioneer in the field, offered a more tongue-in-cheek but equally accurate definition: &lt;em&gt;”. . . a system in which the failure of a computer you didn’t even know existed can render your own computer unusable.”&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;centralized-vs-distributed-architectures&quot;&gt;Centralized vs. Distributed Architectures&lt;/h3&gt;
&lt;p&gt;To appreciate the shift to distributed systems, it is helpful to contrast them with traditional centralized systems.&lt;/p&gt;

&lt;p&gt;In a &lt;strong&gt;Centralized System&lt;/strong&gt;, all calculations are performed by a single computer (like a mainframe). Resources are shared and accessible to users at all times (as long as the system is up). There is a single process control and, crucially, a &lt;strong&gt;single point of failure&lt;/strong&gt;. If the main computer goes down, the entire system stops.&lt;/p&gt;

&lt;p&gt;In contrast, a &lt;strong&gt;Distributed System&lt;/strong&gt; is composed of multiple autonomous components. Resources may not always be accessible if a network link fails. Processing is concurrent, occurring on different processors simultaneously. Control is decentralized, meaning there are multiple points of control and, consequently, &lt;strong&gt;multiple points of failure&lt;/strong&gt;. While this adds complexity, it also adds resilience, as the failure of one node does not necessarily doom the entire system.&lt;/p&gt;

&lt;h3 id=&quot;why-go-distributed&quot;&gt;Why Go Distributed?&lt;/h3&gt;
&lt;p&gt;Despite the added complexity, there are compelling reasons to adopt distributed architectures:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Inherent Distribution&lt;/strong&gt;: Some applications are naturally distributed. For example, a messaging app on two phones requires a system that bridges the physical distance between them.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Reliability&lt;/strong&gt;: By eliminating the single point of failure, distributed systems can offer higher availability. If one server crashes, others can take over its workload.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Performance&lt;/strong&gt;: We can optimize performance by accessing data from a nearby node (reducing latency) or by executing tasks in parallel across massive clusters (increasing throughput).&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;: This is the primary driver for modern “Big Data” systems. Many problems are simply too large for any single machine to hold in memory or process in a reasonable amount of time.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;examples-of-distributed-systems&quot;&gt;Examples of Distributed Systems&lt;/h2&gt;

&lt;p&gt;Distributed systems are ubiquitous in modern technology. &lt;strong&gt;Local Area Networks (LANs)&lt;/strong&gt; connect computers within an office. &lt;strong&gt;Database Management Systems&lt;/strong&gt; often run across multiple servers to handle large datasets. The global network of &lt;strong&gt;Automatic Teller Machines (ATMs)&lt;/strong&gt; is a distributed system that coordinates financial transactions.&lt;/p&gt;

&lt;p&gt;The most famous example is the &lt;strong&gt;World Wide Web (WWW)&lt;/strong&gt; itself, a massive distributed system of clients and servers sharing resources via URLs. &lt;strong&gt;Mobile and Ubiquitous Computing&lt;/strong&gt; extends this concept to devices moving through physical space. And, of course, &lt;strong&gt;Cloud Computing Clusters&lt;/strong&gt; (like those powering AWS, Azure, and GCP) are the pinnacle of distributed infrastructure, orchestrating millions of servers to provide utility computing. Even &lt;strong&gt;Multi-player Online Games&lt;/strong&gt; rely on distributed systems to synchronize the state of the virtual world for thousands of players simultaneously.&lt;/p&gt;

&lt;h3 id=&quot;case-study-scaling-facebook&quot;&gt;Case Study: Scaling Facebook&lt;/h3&gt;
&lt;p&gt;The evolution of Facebook provides a clear illustration of why distributed systems are necessary:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;2004&lt;/strong&gt;: It started as a single server handling both the web traffic and the database.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Separation&lt;/strong&gt;: As traffic grew, they separated the web server from the database server. While this increased capacity, the system would still go offline if either server failed, and there was no redundancy.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Partitioning&lt;/strong&gt;: They began to deploy pairs of servers for specific communities (e.g., one pair for Harvard, one for Yale). This worked until friends from different schools wanted to connect, revealing the problem of isolated partitions.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Horizontal Scaling&lt;/strong&gt;: Today, Facebook uses a massive distributed architecture. The &lt;strong&gt;Front-end&lt;/strong&gt; consists of a scalable number of stateless web servers, with load balancers distributing user traffic. The &lt;strong&gt;Back-end&lt;/strong&gt; uses sharded databases and sophisticated caching layers, managed by complex distributed systems code to ensure billions of users see a consistent view of their news feed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;key-characteristics&quot;&gt;Key Characteristics&lt;/h2&gt;

&lt;p&gt;When assessing the quality and robustness of a distributed system, architects evaluate several key characteristics:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Heterogeneity&lt;/strong&gt;: A robust system can operate over a variety of different networks, hardware, operating systems, and programming languages.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Openness&lt;/strong&gt;: The system should be extendable. This is often achieved through published interfaces (APIs) that allow new components to be added without rewriting existing ones.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: The system must guarantee the confidentiality, integrity, and availability of data, which is harder when data is moving across a network.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: The system should be able to handle growing loads by simply adding more resources, rather than redesigning the architecture.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Failure Handling&lt;/strong&gt;: The system must be able to detect failures, mask them (so the user doesn’t notice), and recover automatically.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;: The system must handle multiple operations executing simultaneously without corrupting data.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Transparency&lt;/strong&gt;: Ideally, the complexity of the distribution should be hidden from the user and the application programmer. The system should appear as a single coherent entity.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;basic-design-issues&quot;&gt;Basic Design Issues&lt;/h2&gt;

&lt;p&gt;Designing a distributed system involves solving several fundamental problems.&lt;/p&gt;

&lt;h3 id=&quot;1-naming&quot;&gt;1. Naming&lt;/h3&gt;
&lt;p&gt;How do we identify resources in a vast network? A &lt;strong&gt;Naming Context&lt;/strong&gt; is required to resolve user-friendly names to machine-readable identifiers. For example, the Domain Name System (DNS) translates &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;google.com&lt;/code&gt; into an IP address. Systems also rely on &lt;strong&gt;Unique Identifiers&lt;/strong&gt; like URIs (Uniform Resource Identifiers) or UUIDs (Universally Unique Identifiers) to distinguish resources globally.&lt;/p&gt;

&lt;h3 id=&quot;2-communication&quot;&gt;2. Communication&lt;/h3&gt;
&lt;p&gt;How do components talk to each other? The fundamental primitive is &lt;strong&gt;Message Passing&lt;/strong&gt; (Send/Receive). Communication can be &lt;strong&gt;Synchronous (Blocking)&lt;/strong&gt;, where the sender waits for a response before continuing, or &lt;strong&gt;Asynchronous (Non-blocking)&lt;/strong&gt;, where the sender basically says “fire and forget,” handling the response later. Common patterns built on top of these include Client-Server interactions (like RPC), Group Multicast, and Publish-Subscribe systems.&lt;/p&gt;

&lt;h3 id=&quot;3-latency-and-bandwidth&quot;&gt;3. Latency and Bandwidth&lt;/h3&gt;
&lt;p&gt;Two physical constraints dominate distributed system performance:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt; is the time delay for a message to arrive. It is limited by the speed of light. Sending a message within the same building might take 1ms, but sending it from continent to continent can take 100ms. “Sneakernet” (driving hard drives in a van) has a latency of about a day.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Bandwidth&lt;/strong&gt; is the volume of data that can be transmitted per unit of time. Modern fiber networks offer high bandwidth (Gigabits per second). Interestingly, a “station wagon full of tapes hurtling down the highway” has incredibly high bandwidth (Petabytes per day) despite its terrible latency. This classic tradeoff, famously noted by Andrew Tanenbaum, reminds us that physical transport is still sometimes the fastest way to move massive datasets.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-software-structure&quot;&gt;4. Software Structure&lt;/h3&gt;
&lt;p&gt;Managing this complexity requires structure. We use &lt;strong&gt;Layers&lt;/strong&gt; of abstraction to separate concerns. &lt;strong&gt;Middleware&lt;/strong&gt; is a crucial layer of software that sits between the operating system and the application. It provides standard services—like Remote Procedure Calls (RPC) or Remote Method Invocation (RMI)—that simplify the development of distributed applications by handling the messy details of networking and coordination.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Distributed systems are the foundation of modern computing, enabling everything from the web to the cloud. While they offer immense benefits in reliability, performance, and scale, they introduce significant complexity. Mastering the challenges of coordination, failure handling, and consistency is essential for any cloud engineer.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01 Introduction to Cloud Computing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_Introduction/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_Introduction</id>
   <content type="html">&lt;p&gt;Cloud computing has fundamentally transformed how organizations approach IT infrastructure, application deployment, and service delivery. This revolutionary paradigm shift represents one of the most significant technological advances of the 21st century, enabling businesses to access computing resources on-demand without the need for substantial upfront investments in hardware and infrastructure.&lt;/p&gt;

&lt;h2 id=&quot;what-is-cloud-computing&quot;&gt;What is Cloud Computing?&lt;/h2&gt;

&lt;p&gt;According to the National Institute of Standards and Technology (NIST), cloud computing is defined as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This definition encapsulates the essence of cloud computing: the ability to access computing resources as easily as turning on a light switch, without worrying about the underlying infrastructure complexity.&lt;/p&gt;

&lt;h2 id=&quot;the-evolution-of-computing-models&quot;&gt;The Evolution of Computing Models&lt;/h2&gt;

&lt;p&gt;To understand the significance of cloud computing, it’s essential to examine the evolution of computing models:&lt;/p&gt;

&lt;h3 id=&quot;1-mainframe-era-1960s-1980s&quot;&gt;1. Mainframe Era (1960s-1980s)&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Centralized computing with terminals&lt;/li&gt;
  &lt;li&gt;High costs and limited accessibility&lt;/li&gt;
  &lt;li&gt;Batch processing and time-sharing systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;2-personal-computing-era-1980s-1990s&quot;&gt;2. Personal Computing Era (1980s-1990s)&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Distributed computing on individual machines&lt;/li&gt;
  &lt;li&gt;Client-server architectures&lt;/li&gt;
  &lt;li&gt;Local area networks (LANs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;3-internet-era-1990s-2000s&quot;&gt;3. Internet Era (1990s-2000s)&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;Web-based applications&lt;/li&gt;
  &lt;li&gt;Distributed systems and grid computing&lt;/li&gt;
  &lt;li&gt;Service-oriented architectures (SOA)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;4-cloud-computing-era-2000s-present&quot;&gt;4. Cloud Computing Era (2000s-Present)&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;On-demand resource provisioning&lt;/li&gt;
  &lt;li&gt;Pay-as-you-use models&lt;/li&gt;
  &lt;li&gt;Massive scalability and global accessibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;why-cloud-computing-matters&quot;&gt;Why Cloud Computing Matters&lt;/h2&gt;

&lt;p&gt;Cloud computing addresses several critical challenges faced by modern organizations:&lt;/p&gt;

&lt;h3 id=&quot;economic-efficiency&quot;&gt;Economic Efficiency&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Capital Expenditure (CapEx) to Operational Expenditure (OpEx)&lt;/strong&gt;: Organizations can shift from large upfront investments to predictable monthly costs&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Economy of Scale&lt;/strong&gt;: Cloud providers can offer services at lower costs due to massive scale operations&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource Optimization&lt;/strong&gt;: Pay only for what you use, when you use it&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;technological-advantages&quot;&gt;Technological Advantages&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Rapid Deployment&lt;/strong&gt;: Applications can be deployed in minutes rather than months&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Global Reach&lt;/strong&gt;: Services can be made available worldwide with minimal effort&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Innovation Acceleration&lt;/strong&gt;: Focus on core business logic rather than infrastructure management&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;business-agility&quot;&gt;Business Agility&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Resources can be scaled up or down based on demand&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Support for various programming languages, frameworks, and tools&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Speed to Market&lt;/strong&gt;: Faster development and deployment cycles&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;real-world-impact&quot;&gt;Real-World Impact&lt;/h2&gt;

&lt;p&gt;Cloud computing has enabled numerous innovations and business models:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Startups&lt;/strong&gt;: Companies like Netflix, Airbnb, and Uber built their entire platforms on cloud infrastructure&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Enterprise Transformation&lt;/strong&gt;: Traditional companies like GE and Capital One have migrated critical workloads to the cloud&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Global Collaboration&lt;/strong&gt;: Remote work and distributed teams are enabled by cloud-based collaboration tools&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Data Analytics&lt;/strong&gt;: Big data processing and machine learning at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;learning-objectives&quot;&gt;Learning Objectives&lt;/h2&gt;

&lt;p&gt;By the end of this chapter, you will understand:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The fundamental characteristics that define cloud computing&lt;/li&gt;
  &lt;li&gt;Different service models (IaaS, PaaS, SaaS) and their use cases&lt;/li&gt;
  &lt;li&gt;Various deployment models and their implications&lt;/li&gt;
  &lt;li&gt;Benefits and challenges associated with cloud adoption&lt;/li&gt;
  &lt;li&gt;Key considerations for cloud strategy and implementation&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s Next?&lt;/h2&gt;

&lt;p&gt;In the following lessons, we’ll dive deeper into each aspect of cloud computing, exploring the technical details, practical implementations, and strategic considerations that will help you make informed decisions about cloud adoption and utilization.&lt;/p&gt;

&lt;p&gt;The journey into cloud computing is not just about understanding technology—it’s about reimagining how we build, deploy, and manage applications in an increasingly connected and digital world.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01-05 Data Centers and Scaling</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_05_Data_Centers_and_Scaling/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_05_Data_Centers_and_Scaling</id>
   <content type="html">&lt;p&gt;Behind the abstract concept of the “Cloud” lies massive physical infrastructure: Data Centers. Understanding how computing scales from a single PC to a warehouse-sized computer is fundamental to cloud engineering.&lt;/p&gt;

&lt;h2 id=&quot;the-need-for-scale&quot;&gt;The Need for Scale&lt;/h2&gt;

&lt;p&gt;Modern web services operate at a staggering scale that is difficult to comprehend. A single server is no longer sufficient to handle the data volume and compute requirements of global applications. Applications today typically process &lt;strong&gt;Petabytes (PB)&lt;/strong&gt; to &lt;strong&gt;Exabytes (EB)&lt;/strong&gt; of data (for context, 1 Zettabyte equals 1 trillion Gigabytes). Services like Facebook and YouTube serve billions of users daily, requiring huge clusters of machines working in parallel to deliver content without latency.&lt;/p&gt;

&lt;h2 id=&quot;two-approaches-to-scaling&quot;&gt;Two Approaches to Scaling&lt;/h2&gt;

&lt;p&gt;When a single computer reaches its limit, system architects faced with a performance bottleneck have two primary strategies to increase capacity: vertical scaling and horizontal scaling.&lt;/p&gt;

&lt;h3 id=&quot;1-vertical-scaling-scale-up&quot;&gt;1. Vertical Scaling (Scale Up)&lt;/h3&gt;

&lt;p&gt;Vertical scaling, often referred to as “scaling up,” involves adding more power (resources like CPU, RAM, or faster storage) to an existing machine. We see this transition in the evolution from a personal computer to a powerful workstation, then to a server, and finally to a &lt;strong&gt;Mainframe&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While this approach is conceptually simple—you just buy a bigger computer—it has severe limitations. First, there is a hard &lt;strong&gt;hardware limit&lt;/strong&gt;; you can only buy a CPU that is so fast, or a motherboard that supports so much RAM. Second, the cost increases exponentially for high-end hardware; the fastest processor is often disproportionately more expensive than a mid-range one. Finally, and perhaps most critically for cloud reliability, a single massive machine represents a &lt;strong&gt;single point of failure&lt;/strong&gt;. If that one super-server crashes, the entire application goes down.&lt;/p&gt;

&lt;h3 id=&quot;2-horizontal-scaling-scale-out&quot;&gt;2. Horizontal Scaling (Scale Out)&lt;/h3&gt;

&lt;p&gt;Horizontal scaling, or “scaling out,” involves adding more machines to a system rather than making a single machine stronger. This was the breakthrough that enabled the modern internet. Instead of one mainframe, you build a &lt;strong&gt;Cluster&lt;/strong&gt; of standard servers, which then grows into a Data Center, and eventually a global network of Data Centers.&lt;/p&gt;

&lt;p&gt;The advantages of this approach are profound. It allows the use of &lt;strong&gt;commodity hardware&lt;/strong&gt;, which is significantly cheaper than high-end mainframes. It offers linear cost scalability—to double the power, you simply buy double the number of cheap servers. Most importantly, it provides &lt;strong&gt;high fault tolerance&lt;/strong&gt;. In a cluster of 1,000 servers, if one fails, the other 999 pick up the load, and the system continues without interruption. The challenge, however, is the increased &lt;strong&gt;complexity&lt;/strong&gt; in software architecture required to manage distributed state and consistency across thousands of nodes.&lt;/p&gt;

&lt;h2 id=&quot;the-data-center-as-a-computer&quot;&gt;The Data Center as a Computer&lt;/h2&gt;

&lt;p&gt;A modern Data Center is essentially a “warehouse-sized computer.” It is not just a room with servers; it is a holistic system designed for efficiency and scale.&lt;/p&gt;

&lt;h3 id=&quot;architecture&quot;&gt;Architecture&lt;/h3&gt;
&lt;p&gt;The building block of a data center is the &lt;strong&gt;Server&lt;/strong&gt;. Dozens of these servers are mounted into a physical frame called a &lt;strong&gt;Rack&lt;/strong&gt; (e.g., 40 servers per rack). Each rack has a “Top of Rack” &lt;strong&gt;Switch&lt;/strong&gt; that connects the servers to the larger network. Hundreds or thousands of these racks are organized into a &lt;strong&gt;Cluster&lt;/strong&gt;, working together as a single computing entity.&lt;/p&gt;

&lt;h3 id=&quot;key-characteristics&quot;&gt;Key Characteristics&lt;/h3&gt;
&lt;p&gt;To support this massive scale, data centers require &lt;strong&gt;Massive Networking&lt;/strong&gt; infrastructure, with high-bandwidth, low-latency fabric interconnecting all nodes. &lt;strong&gt;Redundancy&lt;/strong&gt; is built into every layer: backup power supplies (UPS), diesel generators, redundant cooling systems, and multiple network paths ensure that the facility never goes dark. &lt;strong&gt;Security&lt;/strong&gt; is paramount, with strict physical access controls, biometric scanners, and “man traps” preventing unauthorized entry.&lt;/p&gt;

&lt;h3 id=&quot;energy-and-environmental-impact&quot;&gt;Energy and Environmental Impact&lt;/h3&gt;
&lt;p&gt;Data centers are massive consumers of electricity. A single rack can consume more than 4kW of power, and a hyperscale data center consumes as much energy as a small city. All of that electricity is converted into heat, which must be removed to prevent hardware failure. Consequently, &lt;strong&gt;cooling systems&lt;/strong&gt; often consume 30-50% of the total energy of the facility. This is why many data centers are strategically built near cheap, green energy sources, such as hydroelectric dams in the Columbia River basin, to reduce operational expenses and minimizing environmental impact.&lt;/p&gt;

&lt;h2 id=&quot;modular-and-distributed-data-centers&quot;&gt;Modular and Distributed Data Centers&lt;/h2&gt;

&lt;h3 id=&quot;modular-data-centers&quot;&gt;Modular Data Centers&lt;/h3&gt;
&lt;p&gt;A modern trend to address deployment speed is the &lt;strong&gt;Modular Data Center&lt;/strong&gt;. In this model, servers, networking, and cooling are pre-installed in a standard shipping container. To expand capacity, a company simply ships a new container to the site, plugs in power, water (for cooling), and internet connectivity. This “plug-and-play” data center approach features highly optimized airflow and cooling designs, allowing for rapid expansion.&lt;/p&gt;

&lt;h3 id=&quot;distributed-data-centers&quot;&gt;Distributed Data Centers&lt;/h3&gt;
&lt;p&gt;For global services, a single data center is insufficient due to the laws of physics. &lt;strong&gt;Latency&lt;/strong&gt;—the time it takes for data to travel—is limited by the speed of light. Users in Asia accessing a US data center will experience noticeable lag. Furthermore, relying on a single location creates risk; a natural disaster could wipe out the entire service. To solve this, companies deploy a &lt;strong&gt;global network of distributed data centers&lt;/strong&gt;, replicating data across regions and routing user traffic to the nearest location (the “Edge”). This ensures both high performance for users and resilience for the business.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01-04 Benefits and Challenges of Cloud Computing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_04_Benefits_and_Challenges/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_04_Benefits_and_Challenges</id>
   <content type="html">&lt;p&gt;Cloud computing offers transformative benefits that have revolutionized how organizations approach IT infrastructure and application development. However, like any significant technological shift, it also presents challenges that must be carefully considered and addressed. Understanding both sides is crucial for making informed decisions about cloud adoption.&lt;/p&gt;

&lt;p&gt;Cloud computing offers transformative benefits that have revolutionized how organizations approach IT infrastructure and application development. However, like any significant technological shift, it also presents challenges that must be carefully considered and addressed. Understanding both sides is crucial for making informed decisions about cloud adoption.&lt;/p&gt;

&lt;h2 id=&quot;benefits-of-cloud-computing&quot;&gt;Benefits of Cloud Computing&lt;/h2&gt;

&lt;p&gt;Cloud computing provides tangible strategic value across several dimensions, from financial efficiency to technical agility.&lt;/p&gt;

&lt;h3 id=&quot;1-cost-efficiency-and-financial-advantages&quot;&gt;1. Cost Efficiency and Financial Advantages&lt;/h3&gt;
&lt;p&gt;One of the most compelling arguments for cloud adoption is the shift from &lt;strong&gt;Capital Expenditure (CapEx)&lt;/strong&gt; to &lt;strong&gt;Operational Expenditure (OpEx)&lt;/strong&gt;. In a traditional model, companies had to make significant upfront investments in hardware, software licenses, and data center facilities—often over-provisioning infrastructure to handle potential peak loads that might only occur once a year. Cloud computing eliminates this burden.&lt;/p&gt;

&lt;p&gt;Instead, IT becomes a utility like electricity: you pay only for what you consume. This “pay-as-you-go” model frees up capital for other strategic investments and aligns infrastructure costs directly with business usage. Additionally, cloud providers achieve massive economies of scale, purchasing hardware at volumes that individual enterprises cannot match, and passing those savings on to customers.&lt;/p&gt;

&lt;h3 id=&quot;2-scalability-and-elasticity&quot;&gt;2. Scalability and Elasticity&lt;/h3&gt;
&lt;p&gt;Cloud platforms provide unparalleled scalability. Through &lt;strong&gt;horizontal scaling&lt;/strong&gt; (adding more servers) or &lt;strong&gt;vertical scaling&lt;/strong&gt; (adding more power to an existing server), organizations can respond to demand changes instantly. This means a retailer can handle the massive influx of traffic on Black Friday without crashing, and then scale back down to save money on quiet days. This elasticity ensures that performance is consistent and you are never paying for idle capacity.&lt;/p&gt;

&lt;h3 id=&quot;3-flexibility-and-agility&quot;&gt;3. Flexibility and Agility&lt;/h3&gt;
&lt;p&gt;The cloud enables rapid deployment. In a traditional data center, provisioning a new server could take weeks. In the cloud, developers can spin up a complete customized environment in minutes, drastically reducing the time-to-market for new applications. This agility fosters innovation, allowing teams to experiment with new technologies (like AI or IoT) without the risk of expensive hardware procurement.&lt;/p&gt;

&lt;h3 id=&quot;4-reliability-and-availability&quot;&gt;4. Reliability and Availability&lt;/h3&gt;
&lt;p&gt;Major cloud providers offer reliability that is difficult for a single enterprise to match. With massive global networks, data is often replicated across multiple geographic regions and “Availability Zones.” Cloud services are designed for &lt;strong&gt;High Availability (HA)&lt;/strong&gt; and &lt;strong&gt;Disaster Recovery (DR)&lt;/strong&gt;. If a physical server fails, the system automatically migrates your workload to a healthy instance, often without the user ever noticing a disruption. Service Level Agreements (SLAs) guarantee uptime, often reaching 99.99% or higher.&lt;/p&gt;

&lt;h3 id=&quot;5-security-and-compliance&quot;&gt;5. Security and Compliance&lt;/h3&gt;
&lt;p&gt;While security is often cited as a concern, major cloud providers invest billions in security infrastructure that exceeds what most individual companies can afford. They employ world-class security experts and adhere to strict compliance certifications (such as ISO 27001, SOC 2, and HIPAA). The &lt;strong&gt;Shared Responsibility Model&lt;/strong&gt; ensures that while the provider secures the “cloud” (the physical infrastructure), the customer secures what is “in the cloud” (data and applications), creating a robust security partnership.&lt;/p&gt;

&lt;h2 id=&quot;challenges-of-cloud-computing&quot;&gt;Challenges of Cloud Computing&lt;/h2&gt;

&lt;p&gt;Despite its benefits, cloud computing introduces specific challenges that must be managed to ensure a successful implementation.&lt;/p&gt;

&lt;h3 id=&quot;1-security-and-privacy-concerns&quot;&gt;1. Security and Privacy Concerns&lt;/h3&gt;
&lt;p&gt;Entrusting sensitive data to a third-party provider requires a strategic leap of faith. While providers secure the infrastructure, the risk of data breaches often shifts to &lt;strong&gt;customer misconfiguration&lt;/strong&gt;—such as leaving a storage bucket public or failing to implement proper access controls. Furthermore, the multi-tenant nature of the cloud raises theoretical concerns about data isolation, although serious inter-tenant exploits are extremely rare in practice.&lt;/p&gt;

&lt;h3 id=&quot;2-downtime-and-internet-dependency&quot;&gt;2. Downtime and Internet Dependency&lt;/h3&gt;
&lt;p&gt;Cloud services rely entirely on internet connectivity. A network outage at your office means you cannot access your critical applications. Additionally, even the largest cloud providers experience outages due to technical errors, software bugs, or cyberattacks. These outages can impact thousands of customers simultaneously, potentially taking down widespread services for hours.&lt;/p&gt;

&lt;h3 id=&quot;3-limited-control-and-vendor-lock-in&quot;&gt;3. Limited Control and Vendor Lock-in&lt;/h3&gt;
&lt;p&gt;When you build your application using a provider’s proprietary tools (e.g., a specific database service like AWS DynamoDB or a messaging system like Azure Service Bus), you risk &lt;strong&gt;vendor lock-in&lt;/strong&gt;. Moving that application to another provider later can be difficult and expensive, requiring significant code rewriting. You also surrender some control over backend infrastructure upgrades and maintenance windows, which are managed by the provider.&lt;/p&gt;

&lt;h3 id=&quot;4-cost-management-and-bill-shock&quot;&gt;4. Cost Management and “Bill Shock”&lt;/h3&gt;
&lt;p&gt;While the cloud can save money, it can also lead to runaway costs if not carefully monitored. The ease of spinning up resources means developers might start servers and forget to shut them down. “Shadow IT”—where departments purchase cloud services without IT approval—can also lead to budget overruns. Without proper governance and monitoring (often called &lt;strong&gt;FinOps&lt;/strong&gt;), the monthly bill can be shockingly higher than anticipated.&lt;/p&gt;

&lt;h3 id=&quot;5-data-sovereignty-and-legal-issues&quot;&gt;5. Data Sovereignty and Legal Issues&lt;/h3&gt;
&lt;p&gt;Data stored in the cloud may physically reside in servers across different countries, each with its own laws regarding data access and privacy. For example, the GDPR in Europe imposes strict rules on personal data processing, which might conflict with laws in other jurisdictions where the data is stored. Organizations must ensure that their data placement strategies comply with all relevant local and international regulations.&lt;/p&gt;

&lt;h2 id=&quot;risk-mitigation-strategies&quot;&gt;Risk Mitigation Strategies&lt;/h2&gt;

&lt;p&gt;To navigate these challenges, organizations employ several strategic approaches:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Multi-Cloud Strategy&lt;/strong&gt;: Using services from different providers (e.g., AWS for compute, Google for analytics) to avoid lock-in and increase redundancy.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;FinOps&lt;/strong&gt;: Implementing financial operations practices to monitor cloud spend in real-time, enforce accountability, and optimize costs.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Zero Trust Security&lt;/strong&gt;: Adopting a security model that strictly verifies every person and device trying to access resources, regardless of whether they are sitting within or outside of the network perimeter.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Cloud computing provides tangible advantages in cost, agility, and innovation, but it is not a magic bullet. Success requires a well-thought-out strategy that leverages the benefits while actively managing the risks of security, cost, and lock-in. Organizations must develop new skills and governance models to thrive in this new environment.&lt;/p&gt;

&lt;p&gt;In the next chapter, we’ll explore specific cloud technologies and services that enable organizations to realize these benefits while addressing the associated challenges.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01-03 Cloud Deployment Models</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_03_Deployment_Models/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_03_Deployment_Models</id>
   <content type="html">&lt;p&gt;Cloud deployment models define how cloud infrastructure is deployed, who has access to it, and how it’s managed. Understanding these models is crucial for organizations to choose the right cloud strategy that aligns with their security, compliance, and business requirements.&lt;/p&gt;

&lt;h2 id=&quot;overview-of-deployment-models&quot;&gt;Overview of Deployment Models&lt;/h2&gt;

&lt;p&gt;The four primary cloud deployment models each offer different levels of control, security, and cost considerations:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────┬─────────────────┬─────────────────┬─────────────────┐
│   Public Cloud  │  Private Cloud  │  Hybrid Cloud   │ Community Cloud │
23: │ Shared          │ Dedicated       │ Mixed           │ Shared by Group │
24: │ Multi-tenant    │ Single-tenant   │ Best of Both    │ Common Interests│
25: │ Cost-effective  │ High Control    │ Flexible        │ Cost Sharing    │
26: │ Scalable        │ Secure          │ Complex         │ Specialized     │
└─────────────────┴─────────────────┴─────────────────┴─────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;public-cloud&quot;&gt;Public Cloud&lt;/h2&gt;

&lt;p&gt;Public cloud creates a shared environment where computing resources are accessible to the general public over the internet.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-characteristics&quot;&gt;Definition and Characteristics&lt;/h3&gt;
&lt;p&gt;In a public cloud, third-party providers like AWS, Microsoft Azure, and Google Cloud Platform own and operate the infrastructure. They deliver computing resources—servers, storage, and applications—over the internet. Multiple organizations, or “tenants,” share the same physical hardware, though their data remains logically isolated. This &lt;strong&gt;shared infrastructure&lt;/strong&gt; model allows for massive economies of scale, making public clouds highly cost-effective.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The primary appeal of the public cloud is &lt;strong&gt;cost efficiency&lt;/strong&gt;. With no upfront capital investment required for hardware, businesses can treat IT as an operational expense, paying only for what they use. It offers virtually unlimited &lt;strong&gt;scalability&lt;/strong&gt;, allowing you to spin up thousands of servers in minutes to handle traffic spikes. However, this model comes with trade-offs. You have &lt;strong&gt;limited control&lt;/strong&gt; over where your data physically resides and how the underlying infrastructure is configured, which can be a concern for highly regulated industries. Additionally, since resources are shared, there is a theoretical risk of “noisy neighbors” affecting performance, although modern hypervisors have largely mitigated this.&lt;/p&gt;

&lt;h3 id=&quot;use-cases&quot;&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;Public cloud is the default choice for most modern applications, including web servers, development environments, and data analytics platforms. It is ideal for startups needing to launch quickly and enterprises looking to offload variable workloads.&lt;/p&gt;

&lt;h2 id=&quot;private-cloud&quot;&gt;Private Cloud&lt;/h2&gt;

&lt;p&gt;Private cloud offers a dedicated environment where computing resources are used exclusively by a single business or organization.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-characteristics-1&quot;&gt;Definition and Characteristics&lt;/h3&gt;
&lt;p&gt;A private cloud can be physically located at your organization’s on-site data center or hosted by a third-party service provider. Regardless of location, the key distinction is that the services and infrastructure are maintained on a private network dedicated solely to your organization. This model provides the highest level of security and control, as resources are not shared with other tenants.&lt;/p&gt;

&lt;h3 id=&quot;types-of-private-cloud&quot;&gt;Types of Private Cloud&lt;/h3&gt;
&lt;p&gt;Private clouds can take several forms. An &lt;strong&gt;On-Premises Private Cloud&lt;/strong&gt; is hosted within your own data center, giving you complete control but requiring significant internal expertise to manage the virtualization stack (e.g., VMware, OpenStack). A &lt;strong&gt;Hosted Private Cloud&lt;/strong&gt; involves renting dedicated servers from a provider who manages the hardware for you. A &lt;strong&gt;Virtual Private Cloud (VPC)&lt;/strong&gt; is a hybrid concept where a public cloud provider creates a logically isolated section of their public cloud for your exclusive use, bridging the gap between public and private models.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages-1&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The main advantage of a private cloud is &lt;strong&gt;security and control&lt;/strong&gt;. You can customize the environment to meet specific regulatory requirements (like HIPAA or GDPR) and ensure predictable performance. However, this comes at a steep price: &lt;strong&gt;high cost&lt;/strong&gt;. Building an on-premises private cloud requires substantial capital investment in hardware and ongoing operational costs for power, cooling, and IT staff. It also lacks the massive elasticity of the public cloud; if you run out of capacity, you must physically buy and install more servers.&lt;/p&gt;

&lt;h3 id=&quot;use-cases-1&quot;&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;Private clouds are often necessary for highly regulated industries such as &lt;strong&gt;finance, healthcare, and government&lt;/strong&gt;, where data privacy laws strictly control where and how data is stored. They are also used for mission-critical legacy applications that require specific hardware configurations not available in the public cloud.&lt;/p&gt;

&lt;h2 id=&quot;hybrid-cloud&quot;&gt;Hybrid Cloud&lt;/h2&gt;

&lt;p&gt;Hybrid cloud combines public and private clouds, bound together by technology that allows data and applications to be shared between them.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-characteristics-2&quot;&gt;Definition and Characteristics&lt;/h3&gt;
&lt;p&gt;A hybrid cloud gives you the “best of both worlds” by creating a unified environment. You can keep sensitive data and critical applications in your secure private cloud while leveraging the public cloud’s computational power for less sensitive tasks. For this to work, there must be seamless connectivity and orchestration between the two environments, often achieved through VPNs, Direct Connect links, or container orchestration platforms like Kubernetes.&lt;/p&gt;

&lt;h3 id=&quot;architecture-patterns&quot;&gt;Architecture Patterns&lt;/h3&gt;
&lt;p&gt;One common pattern is &lt;strong&gt;Cloud Bursting&lt;/strong&gt;. An application runs in a private cloud during normal operations but “bursts” into the public cloud during peak demand to handle the overflow traffic. Another pattern is &lt;strong&gt;Data Tiering&lt;/strong&gt;, where sensitive customer data is stored on-premises for compliance, while anonymized data is sent to the public cloud for machine learning analysis.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages-2&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The hybrid model offers unparalleled &lt;strong&gt;flexibility&lt;/strong&gt;. You can optimize costs by using public cloud resources for temporary workloads while maintaining compliance for sensitive data on-premises. It allows for a gradual migration strategy, moving workloads to the cloud at your own pace. However, it is the most &lt;strong&gt;complex&lt;/strong&gt; model to manage. It requires sophisticated networking, consistent security policies across different environments, and a high level of technical expertise to ensure interoperability.&lt;/p&gt;

&lt;h2 id=&quot;community-cloud&quot;&gt;Community Cloud&lt;/h2&gt;

&lt;p&gt;Community cloud is a collaborative effort where infrastructure is shared between several organizations from a specific community with common concerns.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-characteristics-3&quot;&gt;Definition and Characteristics&lt;/h3&gt;
&lt;p&gt;In a community cloud, the infrastructure is shared by several organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It can be managed by the organizations themselves or a third party. This model sits somewhere between public and private: it is not open to everyone, but it is not restricted to just one organization.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages-3&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The key benefit is &lt;strong&gt;cost sharing&lt;/strong&gt;. Organizations with similar needs can pool their resources to build high-quality infrastructure that would be too expensive individually. It fosters &lt;strong&gt;collaboration&lt;/strong&gt; and ensures that all members meet the same industry-specific standards. The downside is &lt;strong&gt;shared governance&lt;/strong&gt;, which can lead to conflicts over resource allocation and policy updates.&lt;/p&gt;

&lt;h3 id=&quot;use-cases-2&quot;&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;Community clouds are common in &lt;strong&gt;government&lt;/strong&gt;, where different agencies share resources on a secure network. They are also found in &lt;strong&gt;healthcare&lt;/strong&gt; (sharing patient records among hospitals) and &lt;strong&gt;academic research&lt;/strong&gt; (universities sharing high-performance computing clusters).&lt;/p&gt;

&lt;h2 id=&quot;choosing-the-right-deployment-model&quot;&gt;Choosing the Right Deployment Model&lt;/h2&gt;

&lt;p&gt;Choosing the right deployment model is a strategic decision that balances cost, control, and compliance.&lt;/p&gt;

&lt;h3 id=&quot;decision-framework&quot;&gt;Decision Framework&lt;/h3&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Public Cloud&lt;/strong&gt;: Choose this for general-purpose workloads, web applications, and when cost and speed are primary drivers.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Private Cloud&lt;/strong&gt;: Choose this if you have strict regulatory requirements, need absolute control over data sovereignty, or have predictable, consistent workloads.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hybrid Cloud&lt;/strong&gt;: Choose this if you need to keep some data on-premises for compliance but want the scalability of the public cloud for other parts of your application.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Community Cloud&lt;/strong&gt;: Choose this if you are part of a consortium or industry group with shared compliance and infrastructure needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;future-trends-in-deployment-models&quot;&gt;Future Trends in Deployment Models&lt;/h2&gt;

&lt;p&gt;The landscape is evolving toward &lt;strong&gt;Multi-Cloud&lt;/strong&gt;, where organizations use services from multiple public cloud providers (e.g., using AWS for compute and Google Cloud for AI) to avoid vendor lock-in. &lt;strong&gt;Edge Computing&lt;/strong&gt; is also rising, pushing cloud capabilities closer to the data source (like IoT devices) to reduce latency.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Understanding cloud deployment models is essential for making informed decisions about cloud strategy. Each model offers different trade-offs in terms of cost, control, security, and complexity. The choice depends on your organization’s specific requirements, including data sensitivity, budget, and scalability needs.&lt;/p&gt;

&lt;p&gt;In the next lesson, we’ll explore the benefits and challenges of cloud computing, helping you understand the full impact of cloud adoption on your organization.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01-02 Cloud Service Models</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_02_Service_Models/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_02_Service_Models</id>
   <content type="html">&lt;p&gt;Cloud computing services are typically categorized into three primary service models, each offering different levels of control, flexibility, and management responsibility. Understanding these models is crucial for selecting the right cloud strategy for your organization’s needs.&lt;/p&gt;

&lt;h2 id=&quot;the-cloud-service-stack&quot;&gt;The Cloud Service Stack&lt;/h2&gt;

&lt;p&gt;The cloud service models can be visualized as a stack, where each layer builds upon the previous one:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;┌─────────────────────────────────────┐
│        Software as a Service        │  ← SaaS
│              (SaaS)                 │
├─────────────────────────────────────┤
│       Platform as a Service         │  ← PaaS
│              (PaaS)                 │
├─────────────────────────────────────┤
│     Infrastructure as a Service     │  ← IaaS
│              (IaaS)                 │
├─────────────────────────────────────┤
│        Physical Infrastructure      │  ← On-Premises
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;infrastructure-as-a-service-iaas&quot;&gt;Infrastructure as a Service (IaaS)&lt;/h2&gt;

&lt;p&gt;Infrastructure as a Service (IaaS) is the foundation of cloud computing. It provides virtualized computing resources over the internet, allowing businesses to rent rather than buy IT infrastructure.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-core-concept&quot;&gt;Definition and Core Concept&lt;/h3&gt;
&lt;p&gt;At its core, IaaS offers the most basic building blocks of cloud computing: virtual machines (VMs), storage, networks, and operating systems. Instead of purchasing physical servers and housing them in an on-premises data center, organizations provision these resources on a pay-as-you-go basis from a cloud provider. This model gives you the highest level of flexibility and management control over your IT resources, effectively mimicking a traditional data center but in a virtualized environment.&lt;/p&gt;

&lt;h3 id=&quot;key-components&quot;&gt;Key Components&lt;/h3&gt;
&lt;p&gt;A typical IaaS environment consists of several key components. &lt;strong&gt;Compute resources&lt;/strong&gt; are the workhorses, ranging from standard virtual machines to bare metal servers and serverless functions. &lt;strong&gt;Storage services&lt;/strong&gt; provide scalable options for different needs, such as block storage for databases or object storage for vast amounts of unstructured data like backups and media files. &lt;strong&gt;Networking&lt;/strong&gt; capabilities allow you to define your own virtual network topology, including subnets, route tables, and firewalls, just as you would with physical switches and routers.&lt;/p&gt;

&lt;h3 id=&quot;responsibility-model&quot;&gt;Responsibility Model&lt;/h3&gt;
&lt;p&gt;In the IaaS model, the cloud provider manages the underlying physical infrastructure—the host servers, virtualization layer, storage hardware, and physical networking. However, you, the customer, are responsible for everything above the hypervisor. This includes the operating system, middleware, runtime environment, data, and applications. Crucially, you are also responsible for the security configuration of these components, such as patching the OS and configuring firewalls.&lt;/p&gt;

&lt;h3 id=&quot;use-cases&quot;&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;IaaS is particularly well-suited for scenarios requiring granular control. It is ideal for &lt;strong&gt;development and testing&lt;/strong&gt;, as teams can spin up temporary environments in minutes and dismantle them just as quickly. It supports &lt;strong&gt;disaster recovery&lt;/strong&gt; strategies by allowing you to replicate critical infrastructure in a different geographic region without the cost of a second physical site. Additionally, &lt;strong&gt;High-Performance Computing (HPC)&lt;/strong&gt; workloads, which often require specific hardware configurations for scientific simulations or financial modeling, thrive on the scalable compute power of IaaS.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The primary advantage of IaaS is &lt;strong&gt;control&lt;/strong&gt;. You have complete freedom to configure the environment to your exact specifications. It avoids the large capital expenditure of buying hardware and allows for rapid scaling. However, this freedom comes with a burden: &lt;strong&gt;management overhead&lt;/strong&gt;. Your team must possess the technical expertise to manage operating systems, security patches, and network configurations, which can be complex and time-consuming.&lt;/p&gt;

&lt;h2 id=&quot;platform-as-a-service-paas&quot;&gt;Platform as a Service (PaaS)&lt;/h2&gt;

&lt;p&gt;Platform as a Service (PaaS) removes the burden of managing the underlying infrastructure, allowing you to focus entirely on improved productivity and application development.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-core-concept-1&quot;&gt;Definition and Core Concept&lt;/h3&gt;
&lt;p&gt;PaaS provides a complete development and deployment environment in the cloud. It includes not just the infrastructure (servers, storage, and networking) but also the middleware, development tools, business intelligence services, database management systems, and more. This model is designed to support the complete web application lifecycle: building, testing, deploying, managing, and updating.&lt;/p&gt;

&lt;h3 id=&quot;key-components-1&quot;&gt;Key Components&lt;/h3&gt;
&lt;p&gt;PaaS offerings typically include a suite of &lt;strong&gt;development tools&lt;/strong&gt; and &lt;strong&gt;runtime environments&lt;/strong&gt; that support various programming languages like Java, Python, and Node.js. They often provide managed &lt;strong&gt;database services&lt;/strong&gt; (both SQL and NoSQL), &lt;strong&gt;caching layers&lt;/strong&gt;, and &lt;strong&gt;message queues&lt;/strong&gt;, removing the need to install and configure these complex systems manually. Furthermore, PaaS solutions usually come with built-in &lt;strong&gt;deployment pipelines (CI/CD)&lt;/strong&gt; and &lt;strong&gt;auto-scaling&lt;/strong&gt; capabilities, ensuring your application can handle traffic spikes without manual intervention.&lt;/p&gt;

&lt;h3 id=&quot;responsibility-model-1&quot;&gt;Responsibility Model&lt;/h3&gt;
&lt;p&gt;The responsibility shift in PaaS is significant. The cloud provider takes on the management of the operating system, middleware, and runtime environment, in addition to the physical infrastructure. Your responsibility is streamlined to usually just two things: your &lt;strong&gt;applications&lt;/strong&gt; and your &lt;strong&gt;data&lt;/strong&gt;. This allows developers to focus on writing code rather than patching servers.&lt;/p&gt;

&lt;h3 id=&quot;use-cases-1&quot;&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;PaaS is the go-to model for &lt;strong&gt;web and mobile application development&lt;/strong&gt;. It allows diverse teams to collaborate on projects regardless of their physical location. It is also excellent for implementing &lt;strong&gt;APIs and microservices&lt;/strong&gt;, where small, independent components can be deployed and managed easily.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages-1&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;The biggest benefit of PaaS is &lt;strong&gt;speed&lt;/strong&gt;. It significantly reduces the time to market by handling the “plumbing” of application delivery. It reduces development complexity and offers built-in scalability. On the downside, PaaS can lead to &lt;strong&gt;vendor lock-in&lt;/strong&gt;, as applications might be built using proprietary tools or APIs that are difficult to migrate to another platform. You also have &lt;strong&gt;less control&lt;/strong&gt; over the underlying environment, which might be a constraint for applications with very specific system-level requirements.&lt;/p&gt;

&lt;h2 id=&quot;software-as-a-service-saas&quot;&gt;Software as a Service (SaaS)&lt;/h2&gt;

&lt;p&gt;Software as a Service (SaaS) is the most familiar model for end-users, delivering fully functional applications over the internet.&lt;/p&gt;

&lt;h3 id=&quot;definition-and-core-concept-2&quot;&gt;Definition and Core Concept&lt;/h3&gt;
&lt;p&gt;SaaS allows users to connect to and use cloud-based apps over the Internet. Common examples are email, calendaring, and office tools. In this model, the cloud provider manages the entire technology stack—from the physical servers up to the application code itself. Users typically access the software via a web browser or a lightweight client app, usually on a subscription basis.&lt;/p&gt;

&lt;h3 id=&quot;key-characteristics&quot;&gt;Key Characteristics&lt;/h3&gt;
&lt;p&gt;SaaS is defined by &lt;strong&gt;multi-tenancy&lt;/strong&gt;, where a single instance of the software serves multiple customers (tenants) while keeping their data isolated. It typically operates on a &lt;strong&gt;subscription model&lt;/strong&gt; (monthly or annual fees) and features &lt;strong&gt;automatic updates&lt;/strong&gt;. Users always have access to the latest version of the software without needing to download patches or perform upgrades.&lt;/p&gt;

&lt;h3 id=&quot;responsibility-model-2&quot;&gt;Responsibility Model&lt;/h3&gt;
&lt;p&gt;In the SaaS model, the customer has the least amount of responsibility, primarily limited to &lt;strong&gt;managing their data&lt;/strong&gt; and &lt;strong&gt;user access&lt;/strong&gt;. The provider handles everything else: application software, security, databases, servers, and network infrastructure.&lt;/p&gt;

&lt;h3 id=&quot;categories-of-saas-applications&quot;&gt;Categories of SaaS Applications&lt;/h3&gt;
&lt;p&gt;SaaS spans a vast array of categories. &lt;strong&gt;Productivity suites&lt;/strong&gt; like Microsoft 365 and Google Workspace enable collaboration. &lt;strong&gt;Customer Relationship Management (CRM)&lt;/strong&gt; tools like Salesforce help businesses manage client interactions. &lt;strong&gt;Enterprise Resource Planning (ERP)&lt;/strong&gt; systems like NetSuite integrate core business processes. Even specialized creative tools like Adobe Creative Cloud are now delivered as SaaS.&lt;/p&gt;

&lt;h3 id=&quot;advantages-and-disadvantages-2&quot;&gt;Advantages and Disadvantages&lt;/h3&gt;
&lt;p&gt;SaaS removes the need for installation, maintenance, and hardware acquisition, making it extremely &lt;strong&gt;accessible&lt;/strong&gt; and easy to deploy. It provides predictable costs through subscriptions. However, it offers the &lt;strong&gt;least amount of control&lt;/strong&gt; and customization. You are bound by the features provided by the vendor, and &lt;strong&gt;data security&lt;/strong&gt; relies heavily on the provider’s measures.&lt;/p&gt;

&lt;h2 id=&quot;choosing-the-right-service-model&quot;&gt;Choosing the Right Service Model&lt;/h2&gt;

&lt;p&gt;Selecting the appropriate service model is a trade-off between control and convenience.&lt;/p&gt;

&lt;h3 id=&quot;decision-framework&quot;&gt;Decision Framework&lt;/h3&gt;
&lt;p&gt;When deciding which model to use, consider the following:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Choose &lt;strong&gt;IaaS&lt;/strong&gt; when you need maximum control, are migrating legacy applications that require specific OS configurations, or have a strong operations team.&lt;/li&gt;
  &lt;li&gt;Choose &lt;strong&gt;PaaS&lt;/strong&gt; when you are building new applications and want to optimize for development speed and minimize administrative overhead.&lt;/li&gt;
  &lt;li&gt;Choose &lt;strong&gt;SaaS&lt;/strong&gt; for standard business processes (email, CRM, HR) where building a custom solution would not provide a competitive advantage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many modern organizations adopt a &lt;strong&gt;hybrid approach&lt;/strong&gt;, utilizing SaaS for productivity, PaaS for new customer-facing apps, and IaaS for specialized workloads that need deep customization.&lt;/p&gt;

&lt;h3 id=&quot;comparison-matrix&quot;&gt;Comparison Matrix&lt;/h3&gt;
&lt;p&gt;| Aspect | IaaS | PaaS | SaaS |
|——–|——|——|——|
| &lt;strong&gt;Control&lt;/strong&gt; | High | Medium | Low |
| &lt;strong&gt;Flexibility&lt;/strong&gt; | High | Medium | Low |
| &lt;strong&gt;Management Overhead&lt;/strong&gt; | High | Medium | Low |
| &lt;strong&gt;Time to Market&lt;/strong&gt; | Slow | Fast | Immediate |
| &lt;strong&gt;Customization&lt;/strong&gt; | High | Medium | Low |
| &lt;strong&gt;Cost Predictability&lt;/strong&gt; | Variable | Predictable | Predictable |
| &lt;strong&gt;Technical Expertise&lt;/strong&gt; | High | Medium | Low |&lt;/p&gt;

&lt;h2 id=&quot;future-trends-in-service-models&quot;&gt;Future Trends in Service Models&lt;/h2&gt;

&lt;p&gt;As cloud computing evolves, the lines between these models are blurring, and new models are emerging. &lt;strong&gt;Function as a Service (FaaS)&lt;/strong&gt;, or serverless computing, is gaining popularity as it abstracts even more infrastructure management than PaaS, executing code only in response to events. &lt;strong&gt;Container as a Service (CaaS)&lt;/strong&gt; sits between IaaS and PaaS, offering a managed environment for deploying containerized applications.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Understanding the three primary cloud service models—IaaS, PaaS, and SaaS—is fundamental to making informed decisions about cloud adoption. Each model offers different trade-offs between control, flexibility, and management overhead. The choice depends on your organization’s technical expertise, business requirements, and strategic objectives.&lt;/p&gt;

&lt;p&gt;In the next lesson, we’ll explore cloud deployment models and how they complement these service models to provide comprehensive cloud solutions.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>01-01 Essential Characteristics of Cloud Computing</title>
   <link href="https://nglelinh.github.io/contents/en/chapter01/01_01_Cloud_Computing_Characteristics/"/>
   <updated>2021-01-01T00:00:00+00:00</updated>
   <id>https://nglelinh.github.io/service-oriented-architecture-and-cloud-computing-iuh/contents/en/chapter01/01_01_Cloud_Computing_Characteristics</id>
   <content type="html">&lt;p&gt;The National Institute of Standards and Technology (NIST) defines five essential characteristics that distinguish cloud computing from traditional computing models. Understanding these characteristics is crucial for recognizing true cloud services and making informed decisions about cloud adoption. These features—On-demand self-service, Broad network access, Resource pooling, Rapid elasticity, and Measured service—collectively define what we know today as the “Cloud.”&lt;/p&gt;

&lt;h2 id=&quot;1-on-demand-self-service&quot;&gt;1. On-Demand Self-Service&lt;/h2&gt;

&lt;p&gt;On-demand self-service enables consumers to unilaterally provision computing capabilities, such as server time and network storage, automatically and without requiring human interaction with the service provider. In a traditional IT environment, requesting a new server often involved a lengthy process: submitting tickets, waiting for approvals from the finance department, and scheduling manual configuration by IT staff. In the cloud model, this friction is effectively eliminated.&lt;/p&gt;

&lt;p&gt;This characteristic empowers users to access resources immediately. Whether a developer needs a staging environment for a few hours or a data scientist requires a high-performance cluster for a complex simulation, they can obtain these resources within minutes—or even seconds—through a web-based dashboard or a programmable API. This level of automation and speed fundamentally shifts the focus from infrastructure procurement to innovation and deployment, giving users complete control over their resource lifecycle.&lt;/p&gt;

&lt;h3 id=&quot;business-impact&quot;&gt;Business Impact&lt;/h3&gt;
&lt;p&gt;The shift from weeks to minutes for resource provisioning dramatically accelerates time-to-market. Businesses can experiment with new ideas, “fail fast,” and iterate rapidly without the penalty of long lead times or sunken costs in unused hardware.&lt;/p&gt;

&lt;h2 id=&quot;2-broad-network-access&quot;&gt;2. Broad Network Access&lt;/h2&gt;

&lt;p&gt;Cloud capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous client platforms. This means that cloud services are not bound to a specific physical location or a specialized device; they are accessible from anywhere with an internet connection, whether on mobile phones, tablets, laptops, or enterprise workstations.&lt;/p&gt;

&lt;p&gt;By relying on standard internet protocols like HTTP, HTTPS, and REST APIs, cloud services ensure pervasive access. This ubiquity supports modern work patterns, allowing remote teams to collaborate seamlessly and providing developers with the flexibility to build applications that serve users globally, regardless of their device or underlying operating system.&lt;/p&gt;

&lt;h3 id=&quot;practical-implications&quot;&gt;Practical Implications&lt;/h3&gt;
&lt;p&gt;This accessibility unifies the experience across different interfaces. A user might upload a file via a web browser, a mobile app might read that file, and a backend server might process it via an API call—all interacting with the same cloud storage service seamlessly over the internet.&lt;/p&gt;

&lt;h2 id=&quot;3-resource-pooling&quot;&gt;3. Resource Pooling&lt;/h2&gt;

&lt;p&gt;The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. This concept is similar to how a utility company generates electricity for an entire city; individual customers don’t own the generator, they simply draw power from the shared grid.&lt;/p&gt;

&lt;h3 id=&quot;multi-tenancy-and-abstraction&quot;&gt;Multi-Tenancy and Abstraction&lt;/h3&gt;
&lt;p&gt;Under the hood, multiple customers (tenants) may share the same physical server, storage array, or network switch, yet they remain logically isolated and secure from one another. This “multi-tenancy” allows providers to achieve significant economies of scale, optimizing equipment usage and energy consumption. For the user, the physical location of the resource is often abstract—they might specify a general region (e.g., “US East” or “Europe”) for latency or compliance reasons, but they rarely know or care about the exact rack or server where their application resides.&lt;/p&gt;

&lt;h2 id=&quot;4-rapid-elasticity&quot;&gt;4. Rapid Elasticity&lt;/h2&gt;

&lt;p&gt;Capabilities can be elastically provisioned and released, often automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.&lt;/p&gt;

&lt;p&gt;This elasticity allows systems to adapt to workload changes in real-time. For example, an e-commerce website can automatically “scale out” (add more web servers) during a Black Friday sale to handle the increased traffic surge. Conversely, once the event is over, the system can “scale in” (remove the extra servers). This dynamic adjustment ensures that performance remains consistent during traffic peaks while costs are minimized during valleys, effectively eliminating the need to over-provision hardware for “worst-case” scenarios.&lt;/p&gt;

&lt;h2 id=&quot;5-measured-service&quot;&gt;5. Measured Service&lt;/h2&gt;

&lt;p&gt;Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service. Just as you pay a utility bill for the water or electricity you consume, cloud computing introduces a pay-as-you-go model.&lt;/p&gt;

&lt;p&gt;Resource usage—whether it be distinct compute instances, storage volume, bandwidth, or number of active accounts—is constantly monitored, controlled, and reported. This transparency provides benefits for both the provider and the consumer. Providers can efficiently manage their infrastructure, while consumers get a clear, itemized view of their consumption. This granular metering enables cost transparency, robust chargeback mechanisms for internal budgeting, and the ability to optimize spending by identifying and shutting down unused resources.&lt;/p&gt;

&lt;h3 id=&quot;pricing-models&quot;&gt;Pricing Models&lt;/h3&gt;
&lt;p&gt;This metering supports various flexible pricing models, such as on-demand pricing for short-term needs, reserved instances for predictable long-term workloads (offering significant discounts), and spot pricing for fault-tolerant tasks that can take advantage of unused capacity at a lower rate.&lt;/p&gt;

&lt;h2 id=&quot;interconnected-nature-of-characteristics&quot;&gt;Interconnected Nature of Characteristics&lt;/h2&gt;

&lt;p&gt;These five characteristics are not standalone features but rather an interconnected system that creates the cloud computing experience:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-mermaid&quot;&gt;graph TD
    A[On-Demand Self-Service] --&amp;gt; E[Measured Service]
    B[Broad Network Access] --&amp;gt; A
    C[Resource Pooling] --&amp;gt; D[Rapid Elasticity]
    D --&amp;gt; E
    E --&amp;gt; A
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;For instance, &lt;strong&gt;Resource Pooling&lt;/strong&gt; creates the massive surplus capacity necessary for &lt;strong&gt;Rapid Elasticity&lt;/strong&gt;. &lt;strong&gt;Broad Network Access&lt;/strong&gt; ensures that the &lt;strong&gt;On-Demand Self-Service&lt;/strong&gt; portal is available to users everywhere. Finally, &lt;strong&gt;Measured Service&lt;/strong&gt; ties it all together by ensuring that this dynamic, self-serviced consumption is accurately tracked and billed.&lt;/p&gt;

&lt;h2 id=&quot;verification-checklist&quot;&gt;Verification Checklist&lt;/h2&gt;

&lt;p&gt;To determine if a service truly embodies cloud computing, you can ask the following questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Self-Service&lt;/strong&gt;: Can users provision resources immediately without human intervention?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Network Access&lt;/strong&gt;: Is the service accessible from multiple devices and locations via standard networks?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Resource Pooling&lt;/strong&gt;: Are resources shared efficiently among multiple users?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Elasticity&lt;/strong&gt;: Can the service scale up and down automatically based on demand?&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Measured Service&lt;/strong&gt;: Is usage monitored, measured, and billed transparently based on consumption?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Understanding these five essential characteristics provides the foundation for evaluating cloud services and making informed decisions about cloud adoption. Each characteristic contributes to the overall value proposition of cloud computing: increased agility, reduced costs, and improved scalability.&lt;/p&gt;

&lt;p&gt;In the next lesson, we’ll explore how these characteristics manifest in different service models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).&lt;/p&gt;
</content>
 </entry>
 

</feed>
