🎬

Video Platform

System Design

Architecture SolidStaff / Principal Level

VOD Ingestion PlatformHigh-Throughput TranscodingGlobal Edge Caching

Design a Video Streaming Service

Youtube is a global online video-sharing and social media platform where users can upload, view, rate, share, comment on, and subscribe to digital video content. Launched in 2005, it operates as a major search engine and entertainment hub. Netflix is a global subscription-based streaming service that allows users to watch a vast library of TV shows, movies, documentaries, and specials on internet-connected devices. In this guide, we will design a highly scalable Video on Demand (VOD) service. We will explore how to balance the write-heavy creator pipeline (uploading, transcoding, and processing) with the read-heavy distribution network (CDN edge-caching and adaptive chunk playback) to support hundreds of millions of concurrent viewers worldwide.

Beginner's Guide

How YouTube & Netflix Work (For Beginners)

💡

In Plain Terms

Both YouTube and Netflix deliver streaming video, but their background workloads are entirely different.

YouTube is a public stage. Anyone can upload video files at any time (User-Generated Content). The system must quickly ingest, transcode, and catalog these raw uploads so viewers can discover and watch them globally in seconds.

Netflix is a curated cinema. Only administrators publish content in scheduled, high-quality batches. The engineering focus is strictly on caching and delivering a zero-lag, stutter-free playback experience to viewers globally.

The Post-Office Analogy: Imagine a global post office. An author writes a heavy book (raw video upload). The central office slices it up into small 6-second chapters (segmentation), prints them in multiple languages and sizes (transcoding), and ships them to thousands of local stands worldwide (CDNs). When a reader wants to read, they fetch chapter 1, and while reading, their desk automatically fetches the next few chapters based on how fast they read (adaptive streaming).

In this guide, we will design a highly scalable Video on Demand (VOD) service. We will explore how to balance the write-heavy creator pipeline (uploading, transcoding, and processing) with the read-heavy distribution network (CDN edge-caching and adaptive chunk playback) to support hundreds of millions of concurrent viewers worldwide.

Step 01

Functional Requirements

Must-Have Features

📤 1. Upload & Ingest

✓
Chunked Video Upload: System must support uploading large video payloads (up to 20GB) without consuming memory limits on application servers.
✓
Upload Pause & Resume: CRITICAL REQUIREMENT: System must support fault-tolerant uploads. If a creator's connection drops, the upload must resume from the last successful chunk without restarting.
✓
Creator Dashboard Upload Status: Allows video publishers to monitor active upload completion and processing states.

📺 2. Playback & Streaming

✓
Adaptive Bitrate Streaming: Transports media seamlessly using segmented protocols (HLS/DASH) across varying client network profiles.
✓
Signed Playback Indexing: Grants authenticated clients geo-restricted playlist maps tied to transient IP keys.

🔍 3. Search & Discovery

✓
Fuzzy Text Vector Search: Leverages index catalogs to lookup video records, matching descriptions and metadata instantly.
✓
Autocomplete Type-ahead Suggestions: Generates high-speed query-matching listings in under 50ms based on historical text models.

Nice-to-Have

○ Content Creator Studio with in-browser clip trims.
○ Real-time chat overlay for premier view countdowns.
○ Interactive dynamic ad inserts mapping client sessions.
○ Multi-channel playlist groupings and video collections.

Out of Scope

✕ Live-streaming pipeline ingestion (focus is strictly VOD architecture).
✕ Turnkey digital content licensing and copyright audit tools.
✕ Payment splits or recurring subscriber subscription billing structures.
✕ Real-time view counter processing pipelines (e.g. streaming engines like Apache Flink or Spark. However, we briefly evaluate how these batch engines integrate with Redis state layers if requested).

Step 02

Non-Functional Requirements

To build a bulletproof streaming platform, we focus strictly on measurable targets. We separate actual physical performance targets (Latency, Availability, and Durability) from our structural and database engine designs:

⚡Playback Latency

›Video playback startup: < 200ms anywhere globally
›Edge segment retrieval: < 30ms latency boundaries
›Player buffering recovery switches in < 1s
›Optimizes cache efficiency using strict TTL separations for dynamic playlists and static segments

🛡Availability targets

›Uptime target: 99.99% for segment retrieval
›Failover routing: Edge CDNs degrade to regional rings if key hubs fail
›Decoupled architecture protects video playing during upload spikes

📈System Scale Limits

›Supports 500M Daily Active Users (DAUs)
›Accepts over 500 hours of uploaded media/min
›Egress capacity handles 75 Tbps during peak hours
›Leverages event-driven worker clustering triggered asynchronously by Kafka streams for GPU codec transcoding

⚖️System Consistency

›Strict consistency for profile changes and upload confirmations
›Eventual consistency (< 3 seconds) for playlist indexing and video list results
›Near-Real-Time (NRT) indexing for global system text searches (~1s refresh limit)

🔑Asset Security

›Signed URL hashes restricted by timestamp limits and client IP bindings
›DRM encryption loops (Widevine, FairPlay) protecting licensed video content
›WAF rate limit triggers filtering automated bot networks

💾Upload Durability

›Raw video assets preserved with 11-nines reliability via S3 storage classes
›Transactional tracking of multipart file write hashes
›Immediate write-ahead log safety for crucial transaction registers

Step 03

Back-of-the-Envelope Estimation

Throughput Math

Constant Variable	Calculated Target	Underlying Formula
Global Footprint	500M DAUs	Active system base
Peak Concurrent streams	25M users	5% of daily active footprint at peak
New content Ingestion	500 min/hr	720,000 hrs uploaded per day
Average video length	8 minutes	Avg UGC size profile
Videos created / day	5.4M records	720K hrs × 60 min / 8 min length
Daily play sessions	1.5B streams	500M DAUs × 3 streams/day average
Average Query Rate	17,400 QPS	1.5B play calls / 86,400 seconds
Peak Query Burst	87,000 QPS	5x average load multiplier

Storage & Network Math

Storage Ingestion Rate

Calculates average 5 transcoding ladders (2.25 GB per video output)

12 PB / day

Year-1 Catalog Footprint

Incremental database growth before cold archive sweeps

4.3 EB

Peak Streaming Output

Concurrent 25M active streams × average 3 Mbps bitrate

75 Tbps

Ingress Video Pipeline

5.4M uploads × average 500MB raw payload sizes

250 Gbps

Daily Edge Egress Traffic

1.5B global view transactions × 18 MB media segments

27 PB / day

Transcode Instance Farm

Based on 360,000 active GPU execution hours per day

45,000 GPU-instances

Step 04

High-Level Design

💡

In Plain Terms

The pipeline divides clean duties. When a video is sent: it hits the Upload Service which tells S3 where to store the heavy raw bytes directly. Once S3 confirms it has the file, it drops a message on the Kafka event bus, which alerts the background conversion workers.

On the playback side, viewers query the Streaming Service to get a signed, customized chapters map. From that point forward, the viewer talks strictly to close-by CDN Edge nodes to fetch individual chapters, keeping our core databases fast and quiet.

Our services run statelessly. Dynamic API calls are processed via the gateway layers, and all heavy assets bypass the application servers entirely by writing directly to S3 bucket keys:

🎬 System Data Flow Direction: Left to Right

Client Gateways App Core Storage

👆 Click components to trace architecture lineages

🛡️ Networking Breakdown: WAF vs. Load Balancer vs. API Gateway

In a large-scale architecture, the Web Application Firewall (WAF), Load Balancer (LB), and API Gateway (APIGW) do not represent the same server. They are distinct, decoupled infrastructure layers operating sequentially:

1. WAF Proxy (Edge Security)

OSI Layer 7 Security

Located nearest to the network perimeter (often integrated at the CDN layer). Its sole job is traffic inspection: filtering SQL injection, cross-site scripting (XSS), bot scraper clusters, and Layer-7 DDoS floods before they can even touch internal services.

2. Load Balancer (Infrastructure Entry)

High-Availability Distribution

A highly specialized appliance (such as AWS ALB or an NGINX ring) optimized to route massive traffic. It distributes the filtered, decrypted HTTPS payloads across a cluster of API Gateway servers, acting as the primary point of failure protection.

3. API Gateway (App Orchestration)

Stateless Routing & Logic

The entrance to your internal microservice mesh. Unlike LBs, the API Gateway runs custom software logic. It coordinates downstream calls, routes paths to individual microservices (e.g., /upload vs. /search), checks request rate-limits, and communicates directly with the Auth Service.

🧩 The Core Concept: Slicing the Loaf of Bread

In system engineering, we never send a single raw video file (which could be several gigabytes) directly down a wire to a user's phone. That would cause massive buffer stalls and high data usage!

Instead, we treat a video like a loaf of bread. During the transcoding phase, we slice the video into small, 6-second segment files (like thin slices of bread) formatted in fragmented MP4 (fMP4) or TS containers.

📁1 Raw Movie

→

⚙️Transcoder

→

seg_01.m4s

seg_02.m4s

seg_03.m4s

When you hit play on YouTube or Netflix, your player fetches a map index file (called an HLS playlist or manifest). It then requests these individual 6-second slices one-by-one. If your Wi-Fi speeds slow down suddenly, the player seamlessly upshifts or downshifts the resolution of the *next* slice without crashing your viewing experience!

Direct-to-S3 Upload Path

Client sends video configurations -> Upload service validates JWT -> generates presigned multi-part S3 keys -> client streams chunks directly.

Kafka Event Bus Spine

Durable log distributing upload notifications, transcode milestones, analytics heartbeats, and database updates.

GPU Auto-Scaling Farm

NVIDIA accelerated worker clusters (Kafka Consumers) scale based on queue lag to process multi-format ladders.

Multi-Tier Cache

Three caching layers: local Guava heap storage, distributed Redis arrays, and global Edge CDNs.

Segmented Streams (fMP4)

Slices files into 6s standalone segments. Solves network switches instantly without interrupting playback.

Tokenized CDN Signatures

Prevents stream link sharing. CDN edge servers confirm token HMAC hashes and client IP bindings locally.

High-Level CAP Strategy

User Metadata (RDBMS)CP (Consistent / Partition)

Strict transactional profile mapping. Relational tables guarantee absolute consistency for account state management.

Video Catalogue (NoSQL)AP (Available / Partition)

Eventual consistency of metadata. Allows write speeds to scale infinitely; index delays of 1-3 seconds are visually imperceptible.

Activity Metrics (NoSQL)AP (Available / Partition)

Optimizes write streams globally. Active comments and activity logs continuous replication targets.

⚖️ Architectural Alternatives & Design Decisions

S3 Intelligent Tiering vs. Static Storage Policies

🎯Chosen: S3 Intelligent Tiering (Hot/Cold Class Lifecycle Management)

✓ Pro: Reduces raw media costs by up to 50% by automatically shifting older, stale, unviewed long-tail assets down to Glacier archive layers.

✗ Con: Restoring retired archive objects to active nodes can introduce brief retrieval latency spikes if cold items are randomly requested.

In-house CDN Infrastructure vs. Third-Party CDNs (Akamai/Fastly)

🎯Chosen: Third-Party CDNs (Edge PoPs) + Layer 2 Origin Shield

✓ Pro: Eliminates immense capital expenditure (CapEx) of building global physical data centers while keeping edge retrieval times under 30ms.

✗ Con: Puts us at the mercy of egress network traffic fees from cloud partners at extreme global scales.

Direct-to-S3 Upload vs. Gateway Proxied Ingest

🎯Chosen: Direct-to-Storage Ingestion (Presigned Multi-Part Chunk Uploads)

✓ Pro: Completely bypasses application servers, eliminating CPU and network memory constraints during massive creator spikes.

✗ Con: Increases orchestrational complexity on client player engines to manage concurrent presigned URL mapping states.

Pre-transcoding All Video Ladders vs. On-Demand Transcoding

🎯Chosen: Pre-transcoding All Quality Ladders (Asynchronous Encoding Paths)

✓ Pro: Guarantees instantaneous playback startup metrics (< 200ms) globally since target segment slices are fully cached and waiting.

✗ Con: Increases the active storage footprint by 4-5x for unviewed long-tail creator catalog assets.

Symmetric vs. Asymmetric Transcode Triggering

🎯Chosen: Asymmetric Transcode Triggering (Event-Driven Kafka Consuming)

✓ Pro: Event-driven asynchronous consumer loops protect server clusters from cascade bottlenecks when heavy raw files land.

✗ Con: Forces creators to check active processing status indicators on their dashboard while worker queues churn.

Active-Active Global Databases vs. Partitioned Region Masters

🎯Chosen: Partitioned Region Masters + High-Read Replicas

✓ Pro: Provides predictable transactional writes and clean consistency models without high risks of active-active split-brain collisions.

✗ Con: Cross-region users accessing foreign home nodes can face slight read-path delays due to replication lag limits.

Step 05

Data Model

Relational Identity Storage (PostgreSQL)

Used strictly for structured authentication and transaction histories requiring ACID guarantees.

sql

-- Core User Schema (Relational PostgreSQL)
CREATE TABLE users (
  id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  username         VARCHAR(50)  UNIQUE NOT NULL,
  email            VARCHAR(255) UNIQUE NOT NULL,
  password_hash    TEXT         NOT NULL,
  created_at       TIMESTAMPTZ  DEFAULT NOW()
);

High-Volume Scale Catalog (DynamoDB NoSQL)

Used for infinitely scalable video catalog metadata. Sharded across wide-key partitions to support hundreds of millions of objects without scaling limits.

json

// Video Catalog Schema (Amazon DynamoDB Representation)
{
  "TableName": "videos",
  "KeySchema": [
    { "AttributeName": "id", "KeyType": "HASH" },       // Partition Key (UUID)
    { "AttributeName": "user_id", "KeyType": "RANGE" }  // Sort Key (Creator UUID)
  ],
  "AttributeDefinitions": [
    { "AttributeName": "id", "AttributeType": "S" },
    { "AttributeName": "user_id", "AttributeType": "S" }
  ],
  "BillingMode": "PAY_PER_REQUEST"
}

📊 Architectural Evaluation: PostgreSQL vs. NoSQL

Architectural Metric	Relational (PostgreSQL)	NoSQL Key-Value (DynamoDB)	Winner & Selection Rationale
Write Scaling	Limited by single primary writes. Sharding is manual and complex.	Infinite scale out. Seamless multi-partition routing.	NoSQL Matches 5.4M uploads/day footprint seamlessly.
Schema Flex	Strict DDL constraints. Schema migrations require lock safety.	Schema-less. Easy attribute addition.	NoSQL Allows adding flexible video transcoder profiles over time.
Query Operations	Supports complex multi-join relational query states.	Key-value lookups only. Joins require custom app-level joins.	Tie RDBMS for core payment ledgers; NoSQL for playback catalog paths.

Elasticsearch Index Mapping

Transforms metadata records into high-performance search-as-you-type indices. By defining the primary search target as a search_as_you_type type field, Elasticsearch automatically breaks text inputs down into structured edge n-grams (e.g. "sy", "sys", "syst", "system").

This indexing step eliminates the need for expensive, platform-crashing database wildcard regex scans (LIKE %query%) in production. Instead, autocomplete responses resolve in O(1) time complexity directly from fast pre-tokenized memory banks.

json

{
  "index": "videos",
  "mappings": {
    "properties": {
      "video_id":      { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "standard",
        "fields": {
          "autocomplete": { "type": "search_as_you_type" }
        }
      },
      "tags":          { "type": "keyword" },
      "view_count":    { "type": "long" },
      "published_at":  { "type": "date" }
    }
  }
}

Transient High-Write Schemas (Cassandra)

Used for high-throughput time-series records. Tuned consistency levels (QUORUM for writes, ONE for reads) prioritize availability over strict transactional guarantees.

sql

-- Threaded Video Comments
CREATE TABLE comments (
  video_id    uuid,
  created_at  timestamp,
  comment_id  uuid,
  user_id     uuid,
  content     text,
  PRIMARY KEY ((video_id), created_at, comment_id)
) WITH CLUSTERING ORDER BY (created_at DESC);

⚖️ Architectural Alternatives & Design Decisions

Cassandra vs. Postgres for Viewer History

🎯Chosen: Apache Cassandra Wide-Column Segments

✓ Pro: Cassandra provides linear write scaling, low disk footprint for wide tables, and localized partition reads.

✗ Con: No transactional JOIN support; requires replicating user profiles across nodes.

Step 06

API Design

RESTJSON over HTTPSToken Authorized

Begins chunked multi-part session. Returns presigned URL map for TUS.

REQUEST BODY

json

{
  "title": "My Scale System Guide",
  "file_size": 2516582400,
  "mime_type": "video/mp4"
}

RESPONSE

json

{
  "upload_id": "ul_01F9A...",
  "video_id": "vid_01F9A...",
  "chunk_size": 5242880,
  "part_urls": [
    { "part_number": 1, "url": "https://s3.amazonaws.com/raw/part1?sig=..." }
  ]
}

KNOWN ERRORS

400 Bad Request – Unsupported codec or format413 Payload Too Large

Step 07

Deep Dive Subsystems

Direct Ingestion via TUS

💡

In Plain Terms

Instead of uploading a massive 20GB video file all at once (where a brief Wi-Fi drop forces you to restart from scratch), the TUSAn open protocol for resumable file uploads. Essential for reliable direct-to-S3 ingestion of video files up to 20GB. protocol acts like a bookmark. It splits the file into 5MB chunks. If your internet disconnects on chunk 300, it resumes from chunk 300 immediately upon reconnecting.

The client negotiates with the Upload Service to generate secure, presigned S3 URLs. The client then pushes 5MB chunks directly to the S3 bucket:

VOD Upload Swimlanes

💡 Interview Tip: This code block is strictly for architectural illustration of concepts. In a system design interview, you are expected to map components and explain API bounds rather than write detailed TypeScript lines.

typescript

// Client-side chunked concurrent uploader using S3 Multipart
async function uploadPartWithRetry(file: File, uploadId: string, parts: PartUrl[]) {
  const CHUNK_SIZE = 5 * 1024 * 1024; // 5MB standard blocks
  const etags: { partNumber: number; etag: string }[] = [];
  const CONCURRENCY_LIMIT = 4;

  for (let i = 0; i < parts.length; i += CONCURRENCY_LIMIT) {
    const batch = parts.slice(i, i + CONCURRENCY_LIMIT);
    const results = await Promise.all(
      batch.map(async (part) => {
        const start = (part.partNumber - 1) * CHUNK_SIZE;
        const chunk = file.slice(start, start + CHUNK_SIZE);
        
        // Direct-to-S3 PUT write
        const res = await fetch(part.url, {
          method: "PUT",
          body: chunk,
          headers: { "Content-Type": file.type }
        });
        
        return { partNumber: part.partNumber, etag: res.headers.get("ETag")! };
      })
    );
    etags.push(...results);
  }
  
  // Confirms state completion and triggers transcoding queue
  await fetch(`/api/v1/videos/${uploadId}/complete`, {
    method: "POST",
    body: JSON.stringify({ parts: etags })
  });
}

Internal Lifecycle State machine

GPU Video Transcoding DAG (Directed Acyclic Graph)

To convert high-definition raw videos into streamable packets without blocking system threads, we split tasks into an independent **Directed Acyclic Graph (DAG)**. This ensures demuxing, multi-resolution scaling, image extraction, and watermark additions proceed in parallel pathways with isolated failure recovery:

Video Processing Directed Acyclic Graph (DAG) Subsystem

👆 Click any DAG node to inspect the underlying video processing tasks

We leverage NVIDIA hardware-accelerated encodings (h264_nvenc) to execute these tasks concurrently. Why hardware acceleration? Dedicated physical silicon ASIC block arrays on modern GPUs process pixel conversions and video compression matrix math much faster and more efficiently than standard multi-core CPUs. Offloading raw framing computations to these dedicated circuits lowers total CPU utilization by up to 95% and reduces infrastructure costs tenfold, enabling concurrent rendering of multiple 4K/1080p target streaming ladders.

bash

# FFmpeg segment scale command targeting visual VMAF perceptual scores
ffmpeg -y -hwaccel cuda -hwaccel_output_format cuda -i input_raw.mov \
  -vf "scale_cuda=1920:1080" \
  -c:v h264_nvenc -preset p4 -b:v 5000k -maxrate 5500k -bufsize 11000k \
  -c:a aac -b:a 192k \
  -f hls \
  -hls_time 6 \
  -hls_playlist_type vod \
  -hls_segment_type fmp4 \
  -hls_segment_filename "s3://prod-media/video-12/1080p/seg_%03d.m4s" \
  "s3://prod-media/video-12/1080p/index.m3u8"

Adaptive Bitrate (ABR) Controller

Implemented directly on the client player via standard BOLABuffer-Occupancy-based Lyapunov Algorithm. An ABR logic that selects video qualities primarily based on current player buffer size. buffer-driven logic. It monitors player buffer health and switches quality classes to prevent playback stutters:

typescript

// Simplified client-side quality decider matching buffer capacities
class ABRController {
  private currentQuality = "720p";
  private bufferLevelSeconds = 30; // Active player buffer
  private estimatedBandwidthBps = 4500000;

  public getTargetQuality(): string {
    // Safety drop-down boundaries
    if (this.bufferLevelSeconds < 5) {
      return "360p"; // Emergency fallback
    }
    if (this.bufferLevelSeconds > 40 && this.estimatedBandwidthBps > 8000000) {
      return "1080p"; // Upshift
    }
    return this.currentQuality;
  }
}

Multi-Tier Cache & Origin Shield

We leverage an Origin ShieldAn extra caching layer in front of S3 storage to collapse multi-CDN request stampedes (thundering herds) into a single call. (L2 regional CDN shield) in front of S3 storage bucket arrays. This collapses thundering herd request patterns on popular video releases into single calls:

Step 08

Bottlenecks & Scaling Mitigations

💡

In Plain Terms

Scale breaks everything. At 500M DAUs, directly writing play events to a database or serving video segments from origin servers will crash the platform. We must implement rate limiters, circuit breakers, and batching layers to safeguard our services.

🔥 Hot Celebrity Content StampsHIGH: Platform-Breaking

An account with 10M subscribers publishes a video, triggering a thundering herd request pattern that bypasses local CDN caches.

🛠️ System Mitigations:

→ Pre-warm Edge CDN caches: Query subscriber registers upon publishing. If subs > 100K, pre-warm the first 3 segments of all qualities.

→ Implement L2 Regional Origin Shields to merge redundant, concurrent S3 read calls into a single query.

🗄️ Database View-Count SaturationHIGH: Platform-Breaking

Millions of viewers trigger concurrent database writes, overwhelming the primary relational database locks.

🛠️ System Mitigations:

→ View-counters are kept OUT of primary transactional paths to protect DB performance.

→ To scale metrics at peak: Route events through Apache Kafka, run aggregations using streaming engines like Apache Flink, and commit changes using Redis in-memory batch loops (INCR) to update databases every 60 seconds.

💾 Storage Cost ExplosionMEDIUM: Performance Degrading

The infinite growth of uploaded UGC videos quickly balloons S3 storage costs.

🛠️ System Mitigations:

→ Apply S3 lifecycle policies: Migrate assets to Infrequent Access (IA) at 30 days, then to cold Glacier at 180 days.

→ Automate the cleanup of older, unpopular 4K transcoded folders to reclaim disk space.

📊 Transcoding Pipeline CongestionMEDIUM: Performance Degrading

A sudden influx of creators uploading content delays transcoding times, leaving videos stuck in a pending queue.

🛠️ System Mitigations:

→ Kubernetes KEDA scales GPU transcoding workers dynamically based on Kafka consumer lag metrics.

→ Prioritize the transcoding queue: Route uploads from popular, verified creators to high-priority Kafka topics.

📚 Quiz: Test Your Understanding

Check how well you learned the URL shortener system design. 20 questions.

Question 1 of 200 / 20 correct

Why does the system use direct-to-S3 chunked uploads (via the TUS protocol) instead of routing video bytes through the API Gateway?

VOD Streaming Platform Design Walkthrough · 8-Step Framework · Built for Staff Engineering Candidates

HLSMPEG-DASHTUS ProtocolNVIDIA TranscodingOrigin ShieldElasticsearchCassandraPostgreSQL Sharding