Design a Video Streaming Service
Youtube is a global online video-sharing and social media platform where users can upload, view, rate, share, comment on, and subscribe to digital video content. Launched in 2005, it operates as a major search engine and entertainment hub. Netflix is a global subscription-based streaming service that allows users to watch a vast library of TV shows, movies, documentaries, and specials on internet-connected devices. In this guide, we will design a highly scalable Video on Demand (VOD) service. We will explore how to balance the write-heavy creator pipeline (uploading, transcoding, and processing) with the read-heavy distribution network (CDN edge-caching and adaptive chunk playback) to support hundreds of millions of concurrent viewers worldwide.
How YouTube & Netflix Work (For Beginners)
Both YouTube and Netflix deliver streaming video, but their background workloads are entirely different.
YouTube is a public stage. Anyone can upload video files at any time (User-Generated Content). The system must quickly ingest, transcode, and catalog these raw uploads so viewers can discover and watch them globally in seconds.
Netflix is a curated cinema. Only administrators publish content in scheduled, high-quality batches. The engineering focus is strictly on caching and delivering a zero-lag, stutter-free playback experience to viewers globally.
The Post-Office Analogy: Imagine a global post office. An author writes a heavy book (raw video upload). The central office slices it up into small 6-second chapters (segmentation), prints them in multiple languages and sizes (transcoding), and ships them to thousands of local stands worldwide (CDNs). When a reader wants to read, they fetch chapter 1, and while reading, their desk automatically fetches the next few chapters based on how fast they read (adaptive streaming).
In this guide, we will design a highly scalable Video on Demand (VOD) service. We will explore how to balance the write-heavy creator pipeline (uploading, transcoding, and processing) with the read-heavy distribution network (CDN edge-caching and adaptive chunk playback) to support hundreds of millions of concurrent viewers worldwide.
Functional Requirements
Must-Have Features
๐ค 1. Upload & Ingest
- โChunked Video Upload: System must support uploading large video payloads (up to 20GB) without consuming memory limits on application servers.
- โUpload Pause & Resume: CRITICAL REQUIREMENT: System must support fault-tolerant uploads. If a creator's connection drops, the upload must resume from the last successful chunk without restarting.
- โCreator Dashboard Upload Status: Allows video publishers to monitor active upload completion and processing states.
๐บ 2. Playback & Streaming
- โAdaptive Bitrate Streaming: Transports media seamlessly using segmented protocols (HLS/DASH) across varying client network profiles.
- โSigned Playback Indexing: Grants authenticated clients geo-restricted playlist maps tied to transient IP keys.
๐ 3. Search & Discovery
- โFuzzy Text Vector Search: Leverages index catalogs to lookup video records, matching descriptions and metadata instantly.
- โAutocomplete Type-ahead Suggestions: Generates high-speed query-matching listings in under 50ms based on historical text models.
Nice-to-Have
- โ Content Creator Studio with in-browser clip trims.
- โ Real-time chat overlay for premier view countdowns.
- โ Interactive dynamic ad inserts mapping client sessions.
- โ Multi-channel playlist groupings and video collections.
Out of Scope
- โ Live-streaming pipeline ingestion (focus is strictly VOD architecture).
- โ Turnkey digital content licensing and copyright audit tools.
- โ Payment splits or recurring subscriber subscription billing structures.
- โ Real-time view counter processing pipelines (e.g. streaming engines like Apache Flink or Spark. However, we briefly evaluate how these batch engines integrate with Redis state layers if requested).
Non-Functional Requirements
- โบVideo playback startup: < 200ms anywhere globally
- โบEdge segment retrieval: < 30ms latency boundaries
- โบPlayer buffering recovery switches in < 1s
- โบOptimizes cache efficiency using strict TTL separations for dynamic playlists and static segments
- โบUptime target: 99.99% for segment retrieval
- โบFailover routing: Edge CDNs degrade to regional rings if key hubs fail
- โบDecoupled architecture protects video playing during upload spikes
- โบSupports 500M Daily Active Users (DAUs)
- โบAccepts over 500 hours of uploaded media/min
- โบEgress capacity handles 75 Tbps during peak hours
- โบLeverages event-driven worker clustering triggered asynchronously by Kafka streams for GPU codec transcoding
- โบStrict consistency for profile changes and upload confirmations
- โบEventual consistency (< 3 seconds) for playlist indexing and video list results
- โบNear-Real-Time (NRT) indexing for global system text searches (~1s refresh limit)
- โบSigned URL hashes restricted by timestamp limits and client IP bindings
- โบDRM encryption loops (Widevine, FairPlay) protecting licensed video content
- โบWAF rate limit triggers filtering automated bot networks
- โบRaw video assets preserved with 11-nines reliability via S3 storage classes
- โบTransactional tracking of multipart file write hashes
- โบImmediate write-ahead log safety for crucial transaction registers
Back-of-the-Envelope Estimation
Throughput Math
| Constant Variable | Calculated Target | Underlying Formula |
|---|---|---|
| Global Footprint | 500M DAUs | Active system base |
| Peak Concurrent streams | 25M users | 5% of daily active footprint at peak |
| New content Ingestion | 500 min/hr | 720,000 hrs uploaded per day |
| Average video length | 8 minutes | Avg UGC size profile |
| Videos created / day | 5.4M records | 720K hrs ร 60 min / 8 min length |
| Daily play sessions | 1.5B streams | 500M DAUs ร 3 streams/day average |
| Average Query Rate | 17,400 QPS | 1.5B play calls / 86,400 seconds |
| Peak Query Burst | 87,000 QPS | 5x average load multiplier |
Storage & Network Math
High-Level Design
The pipeline divides clean duties. When a video is sent: it hits the Upload Service which tells S3 where to store the heavy raw bytes directly. Once S3 confirms it has the file, it drops a message on the Kafka event bus, which alerts the background conversion workers.
On the playback side, viewers query the Streaming Service to get a signed, customized chapters map. From that point forward, the viewer talks strictly to close-by CDN Edge nodes to fetch individual chapters, keeping our core databases fast and quiet.
Our services run statelessly. Dynamic API calls are processed via the gateway layers, and all heavy assets bypass the application servers entirely by writing directly to S3 bucket keys:
๐ Click components to trace architecture lineages
๐ก๏ธ Networking Breakdown: WAF vs. Load Balancer vs. API Gateway
In a large-scale architecture, the Web Application Firewall (WAF), Load Balancer (LB), and API Gateway (APIGW) do not represent the same server. They are distinct, decoupled infrastructure layers operating sequentially:
Located nearest to the network perimeter (often integrated at the CDN layer). Its sole job is traffic inspection: filtering SQL injection, cross-site scripting (XSS), bot scraper clusters, and Layer-7 DDoS floods before they can even touch internal services.
A highly specialized appliance (such as AWS ALB or an NGINX ring) optimized to route massive traffic. It distributes the filtered, decrypted HTTPS payloads across a cluster of API Gateway servers, acting as the primary point of failure protection.
The entrance to your internal microservice mesh. Unlike LBs, the API Gateway runs custom software logic. It coordinates downstream calls, routes paths to individual microservices (e.g., /upload vs. /search), checks request rate-limits, and communicates directly with the Auth Service.
๐งฉ The Core Concept: Slicing the Loaf of Bread
In system engineering, we never send a single raw video file (which could be several gigabytes) directly down a wire to a user's phone. That would cause massive buffer stalls and high data usage!
Instead, we treat a video like a loaf of bread. During the transcoding phase, we slice the video into small, 6-second segment files (like thin slices of bread) formatted in fragmented MP4 (fMP4) or TS containers.
When you hit play on YouTube or Netflix, your player fetches a map index file (called an HLS playlist or manifest). It then requests these individual 6-second slices one-by-one. If your Wi-Fi speeds slow down suddenly, the player seamlessly upshifts or downshifts the resolution of the *next* slice without crashing your viewing experience!
High-Level CAP Strategy
Strict transactional profile mapping. Relational tables guarantee absolute consistency for account state management.
Eventual consistency of metadata. Allows write speeds to scale infinitely; index delays of 1-3 seconds are visually imperceptible.
Optimizes write streams globally. Active comments and activity logs continuous replication targets.
โ๏ธ Architectural Alternatives & Design Decisions
Data Model
Relational Identity Storage (PostgreSQL)
Used strictly for structured authentication and transaction histories requiring ACID guarantees.
-- Core User Schema (Relational PostgreSQL)
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);High-Volume Scale Catalog (DynamoDB NoSQL)
Used for infinitely scalable video catalog metadata. Sharded across wide-key partitions to support hundreds of millions of objects without scaling limits.
// Video Catalog Schema (Amazon DynamoDB Representation)
{
"TableName": "videos",
"KeySchema": [
{ "AttributeName": "id", "KeyType": "HASH" }, // Partition Key (UUID)
{ "AttributeName": "user_id", "KeyType": "RANGE" } // Sort Key (Creator UUID)
],
"AttributeDefinitions": [
{ "AttributeName": "id", "AttributeType": "S" },
{ "AttributeName": "user_id", "AttributeType": "S" }
],
"BillingMode": "PAY_PER_REQUEST"
}๐ Architectural Evaluation: PostgreSQL vs. NoSQL
| Architectural Metric | Relational (PostgreSQL) | NoSQL Key-Value (DynamoDB) | Winner & Selection Rationale |
|---|---|---|---|
| Write Scaling | Limited by single primary writes. Sharding is manual and complex. | Infinite scale out. Seamless multi-partition routing. | NoSQL Matches 5.4M uploads/day footprint seamlessly. |
| Schema Flex | Strict DDL constraints. Schema migrations require lock safety. | Schema-less. Easy attribute addition. | NoSQL Allows adding flexible video transcoder profiles over time. |
| Query Operations | Supports complex multi-join relational query states. | Key-value lookups only. Joins require custom app-level joins. | Tie RDBMS for core payment ledgers; NoSQL for playback catalog paths. |
Elasticsearch Index Mapping
Transforms metadata records into high-performance search-as-you-type indices. By defining the primary search target as a search_as_you_type type field, Elasticsearch automatically breaks text inputs down into structured edge n-grams (e.g. "sy", "sys", "syst", "system").
This indexing step eliminates the need for expensive, platform-crashing database wildcard regex scans (LIKE %query%) in production. Instead, autocomplete responses resolve in O(1) time complexity directly from fast pre-tokenized memory banks.
{
"index": "videos",
"mappings": {
"properties": {
"video_id": { "type": "keyword" },
"title": {
"type": "text",
"analyzer": "standard",
"fields": {
"autocomplete": { "type": "search_as_you_type" }
}
},
"tags": { "type": "keyword" },
"view_count": { "type": "long" },
"published_at": { "type": "date" }
}
}
}Transient High-Write Schemas (Cassandra)
Used for high-throughput time-series records. Tuned consistency levels (QUORUM for writes, ONE for reads) prioritize availability over strict transactional guarantees.
-- Threaded Video Comments
CREATE TABLE comments (
video_id uuid,
created_at timestamp,
comment_id uuid,
user_id uuid,
content text,
PRIMARY KEY ((video_id), created_at, comment_id)
) WITH CLUSTERING ORDER BY (created_at DESC);โ๏ธ Architectural Alternatives & Design Decisions
API Design
Begins chunked multi-part session. Returns presigned URL map for TUS.
{
"title": "My Scale System Guide",
"file_size": 2516582400,
"mime_type": "video/mp4"
}{
"upload_id": "ul_01F9A...",
"video_id": "vid_01F9A...",
"chunk_size": 5242880,
"part_urls": [
{ "part_number": 1, "url": "https://s3.amazonaws.com/raw/part1?sig=..." }
]
}Deep Dive Subsystems
Direct Ingestion via TUS
The client negotiates with the Upload Service to generate secure, presigned S3 URLs. The client then pushes 5MB chunks directly to the S3 bucket:
VOD Upload Swimlanes
// Client-side chunked concurrent uploader using S3 Multipart
async function uploadPartWithRetry(file: File, uploadId: string, parts: PartUrl[]) {
const CHUNK_SIZE = 5 * 1024 * 1024; // 5MB standard blocks
const etags: { partNumber: number; etag: string }[] = [];
const CONCURRENCY_LIMIT = 4;
for (let i = 0; i < parts.length; i += CONCURRENCY_LIMIT) {
const batch = parts.slice(i, i + CONCURRENCY_LIMIT);
const results = await Promise.all(
batch.map(async (part) => {
const start = (part.partNumber - 1) * CHUNK_SIZE;
const chunk = file.slice(start, start + CHUNK_SIZE);
// Direct-to-S3 PUT write
const res = await fetch(part.url, {
method: "PUT",
body: chunk,
headers: { "Content-Type": file.type }
});
return { partNumber: part.partNumber, etag: res.headers.get("ETag")! };
})
);
etags.push(...results);
}
// Confirms state completion and triggers transcoding queue
await fetch(`/api/v1/videos/${uploadId}/complete`, {
method: "POST",
body: JSON.stringify({ parts: etags })
});
}Internal Lifecycle State machine
GPU Video Transcoding DAG (Directed Acyclic Graph)
To convert high-definition raw videos into streamable packets without blocking system threads, we split tasks into an independent **Directed Acyclic Graph (DAG)**. This ensures demuxing, multi-resolution scaling, image extraction, and watermark additions proceed in parallel pathways with isolated failure recovery:
Video Processing Directed Acyclic Graph (DAG) Subsystem
h264_nvenc) to execute these tasks concurrently. Why hardware acceleration? Dedicated physical silicon ASIC block arrays on modern GPUs process pixel conversions and video compression matrix math much faster and more efficiently than standard multi-core CPUs. Offloading raw framing computations to these dedicated circuits lowers total CPU utilization by up to 95% and reduces infrastructure costs tenfold, enabling concurrent rendering of multiple 4K/1080p target streaming ladders.# FFmpeg segment scale command targeting visual VMAF perceptual scores
ffmpeg -y -hwaccel cuda -hwaccel_output_format cuda -i input_raw.mov \
-vf "scale_cuda=1920:1080" \
-c:v h264_nvenc -preset p4 -b:v 5000k -maxrate 5500k -bufsize 11000k \
-c:a aac -b:a 192k \
-f hls \
-hls_time 6 \
-hls_playlist_type vod \
-hls_segment_type fmp4 \
-hls_segment_filename "s3://prod-media/video-12/1080p/seg_%03d.m4s" \
"s3://prod-media/video-12/1080p/index.m3u8"Adaptive Bitrate (ABR) Controller
Implemented directly on the client player via standard BOLABuffer-Occupancy-based Lyapunov Algorithm. An ABR logic that selects video qualities primarily based on current player buffer size. buffer-driven logic. It monitors player buffer health and switches quality classes to prevent playback stutters:
// Simplified client-side quality decider matching buffer capacities
class ABRController {
private currentQuality = "720p";
private bufferLevelSeconds = 30; // Active player buffer
private estimatedBandwidthBps = 4500000;
public getTargetQuality(): string {
// Safety drop-down boundaries
if (this.bufferLevelSeconds < 5) {
return "360p"; // Emergency fallback
}
if (this.bufferLevelSeconds > 40 && this.estimatedBandwidthBps > 8000000) {
return "1080p"; // Upshift
}
return this.currentQuality;
}
}Multi-Tier Cache & Origin Shield
We leverage an Origin ShieldAn extra caching layer in front of S3 storage to collapse multi-CDN request stampedes (thundering herds) into a single call. (L2 regional CDN shield) in front of S3 storage bucket arrays. This collapses thundering herd request patterns on popular video releases into single calls:
Bottlenecks & Scaling Mitigations
An account with 10M subscribers publishes a video, triggering a thundering herd request pattern that bypasses local CDN caches.
Millions of viewers trigger concurrent database writes, overwhelming the primary relational database locks.
The infinite growth of uploaded UGC videos quickly balloons S3 storage costs.
A sudden influx of creators uploading content delays transcoding times, leaving videos stuck in a pending queue.
๐ Quiz: Test Your Understanding
Check how well you learned the URL shortener system design. 20 questions.