๐Ÿ”—
URL Shortener
System Design
System Design InterviewDistributed Systemsbit.ly / tinyurl clone

Design URL Shortener

A production-grade URL shortening service capable of handling billions of redirects. This walkthrough covers high-throughput read paths, distributed Snowflake ID generation, caching strategies, robust rate limiting, and includes an interactive quiz.

Step 01

Functional Requirements

Core Features

๐Ÿ”—
URL Shortening
Given a long URL, generate a unique short URL (โ‰ค8 chars)
โ†—๏ธ
URL Redirection
Redirect short URL to original with <50ms p99 latency
โฐ
Custom Expiry
Support TTL / expiration dates per link
โœ๏ธ
Custom Aliases
Allow users to define custom slugs (e.g. /my-promo)
๐Ÿ“Š
Analytics
Track click count, referrer, geo, user-agent
๐Ÿ”’
Access Control
Optional password protection or private links

Out of Scope (V1)

  • โ—‹ Real-time click stream dashboards
  • โ—‹ Link-in-bio landing pages
  • โ—‹ QR code generation (V2)
  • โ—‹ Team / organization management
  • โ—‹ Browser extension
  • โ—‹ Bulk URL upload

User Journeys

๐Ÿ‘ค Anonymous User
โ†’ Shorten URL (limited)
โ†’ Follow redirect
โ†’ View public analytics
๐Ÿ‘คโœ“ Registered User
โ†’ Create custom alias
โ†’ Set expiry
โ†’ View own analytics dashboard
โ†’ Manage & delete links
โš™๏ธ API Consumer
โ†’ API key authentication
โ†’ Bulk shortening
โ†’ Webhook on clicks
โ†’ Programmatic analytics

โš–๏ธ Alternatives & Tradeoffs

Option: Custom Aliases vs. Auto-generated only
โœ“ Pro: Higher user value, brand-friendly links
โœ— Con: Namespace conflicts, reservation squatting
Option: Analytics in same service vs. separate
โœ“ Pro: Fewer network hops if co-located
โœ— Con: Analytics load shouldn't impact redirect latency; separate is better
Option: Expiry via TTL vs. background job
โœ“ Pro: Redis TTL is instant and zero overhead
โœ— Con: Stale DB entries if only using Redis TTL; need DB cleanup job too
Step 02

Non-Functional Requirements

โšกPerformance
  • โ€บRead Latency: < 10ms p50 / < 50ms p99
  • โ€บWrite Latency: < 200ms p99
  • โ€บThroughput: 100K reads/sec, 1K writes/sec at peak
๐ŸŸขAvailability
  • โ€บSLA: 99.99% uptime (52 min downtime/year)
  • โ€บFailover: Auto-failover < 30 seconds
  • โ€บRedundancy: Multi-AZ deployment; no SPOF
๐Ÿ“ˆScalability
  • โ€บURLs stored: 100M+ unique short URLs
  • โ€บHorizontal: Stateless services, scale-out pattern
  • โ€บDB sharding: Shard on short_code hash
๐Ÿ›ก๏ธDurability
  • โ€บPersistence: URLs must not be lost
  • โ€บBackup: Daily snapshots + point-in-time recovery
  • โ€บReplication: 3ร— replicas across 2+ AZs
๐Ÿ”Security
  • โ€บAuth: JWT tokens + API key
  • โ€บRate Limiting: 100 URLs/hr anon, 10K/hr premium
  • โ€บMalicious URLs: Google Safe Browsing API check
๐Ÿ”„Consistency
  • โ€บModel: Eventual consistency for analytics OK
  • โ€บURL reads: Strong consistency (read-your-writes)
  • โ€บUniqueness: Strict guarantee for short_code

CAP Theorem Positioning

The URL shortener prioritizes Availability (A) + Partition Tolerance (P) โ€” i.e. an AP system.

If a node goes down, we prefer serving stale cache over returning errors. For the write path (URL creation), we briefly sacrifice availability for consistency to guarantee unique short codes.

ConsistencyAvailabilityPartitionAP โœ“

โš–๏ธ Alternatives & Tradeoffs

Option: Strong vs. Eventual Consistency for redirects
โœ“ Pro: Eventual: cache can serve redirects without DB round-trip, massive latency gain
โœ— Con: A deleted URL might still redirect for seconds (cache TTL window)
Option: 99.99% vs 99.999% SLA
โœ“ Pro: 99.99% is achievable with multi-AZ; far lower cost
โœ— Con: 99.999% needs active-active multi-region, significantly higher infra cost
Option: Synchronous vs Async analytics write
โœ“ Pro: Async: redirect path stays at <10ms; async queue absorbs spikes
โœ— Con: Analytics may have seconds delay; small risk of click loss on queue failure
Step 03

Back-of-the-Envelope Estimation

๐Ÿ“Š Traffic Assumptions

DAU100 million
Read:Write Ratio100:1
New URLs/day1 million
Redirects/day100 million
Write RPS~12 req/s
Read RPS~1,200 req/s
Peak Read RPS (10ร—)~12,000 req/s

๐Ÿ—„๏ธ Storage & Cache

URL record size~500 bytes
New records/year365M
Storage/year~182 GB
5-year storage~1 TB
With replication (3ร—)~3 TB
Click events/day100M @ 200B = 20 GB
Click data (1 yr)~7.3 TB

๐Ÿ”ข Short Code Math

Base62 charset: a-z A-Z 0-9 = 62 chars

6-char code
62โถ = ~56 billion
7-char code
62โท = ~3.5 trillion โ† chosen
8-char code
62โธ = ~218 trillion

At 1M new URLs/day, a 7-char code space lasts ~9,589 years.

โš–๏ธ Alternatives & Tradeoffs

Option: 7-char vs 6-char codes
โœ“ Pro: 7-char gives 3.5 trillion permutations, effectively infinite for any real system
โœ— Con: 1 extra char in URL is negligible cost; 6-char is sufficient for 10 years at current scale
Option: Cache all vs LRU eviction
โœ“ Pro: LRU eviction keeps cache small; hot URLs naturally stay in cache
โœ— Con: Cold start problem: first access always misses; warm-up strategy needed post-deploy
Step 04

High-Level Design

๐Ÿ—๏ธ Architecture Overview

This High-Level Design outlines how the URL Shortener efficiently handles both read-heavy redirect traffic and write-heavy URL creation spikes. Client requests hit the CDN and Load Balancer, which route traffic to stateless Read or Write microservices. A caching layer provides sub-millisecond lookups, while an asynchronous message queue buffers database writes and analytics to ensure the core system remains highly responsive under load. Click on any component below to learn more about its specific role.

Services LayerData LayerExternal / IDstaticrequestPOSTGETstatsenqueuesetlookupmissclickget IDwritevalidatestore๐ŸŒClientโšกCDN / Edgeโš–๏ธLoad Balancerโœ๏ธWrite Serviceโ†—๏ธRead/Redirect๐Ÿ“ŠAnalytics Svc๐Ÿ—„๏ธRedis Cache๐Ÿ”ฑURL DB (NoSQL)๐Ÿ˜Meta DB (SQL)๐Ÿ“จMessage Queue๐Ÿ”‘ID Generator๐Ÿ“ˆAnalytics DB
Click on nodes for details ยท Dashed lines = async paths

๐Ÿ‘† Click any component

Client
User interfaces making API calls to shorten URLs or browsers following redirect links.
CDN / Edge
Caches static assets and highly popular URL redirects at the edge, reducing latency and backend load.
Load Balancer
Distributes incoming network traffic evenly across stateless backend services.
Write Service
Validates input, requests unique IDs, and stores the short-to-long URL mappings in the database and cache.
Read/Redirect Service
Performs cache-first lookups for short codes and returns 302 HTTP redirects. Falls back to DB on cache misses.
Analytics Service
Consumes click events asynchronously from the message queue and aggregates them for reporting.
Redis Cache
In-memory datastore holding popular URL mappings for sub-millisecond read access.
URL DB (NoSQL)
Highly scalable Key-Value store (e.g., DynamoDB) serving as the primary source of truth for URL mappings.
Metadata DB (SQL)
Relational database (e.g., PostgreSQL) for user accounts, billing, and custom domain configurations.
Message Queue
Event bus (e.g., Kafka) that buffers writes during flash traffic and decouples the critical path from analytics.
ID Generator
Standalone distributed service (e.g., Twitter Snowflake) providing guaranteed unique IDs without database locking.
Analytics DB
Columnar database (e.g., ClickHouse) highly optimized for fast, large-scale OLAP queries on click data.

โœ๏ธ Write Path

Write Path Sequence
ClientLBWrite SvcID GeneratorMessage QueueDB WorkerURL DBCacheAsync (background)POST /shorten {long_url}route to write servicegenerate unique IDreturn short_codeenqueue {short_code, long_url}CACHE SET short_code201 {short_url}return short URLasync dequeuePutItem(url_record)ack
Short Code Generation Strategies
MD5 / Base62 Hash
FLOW
1
Input: long URL
2
MD5(long_url) โ†’ 128-bit hash
3
Take first 43 bits
4
Base62 encode โ†’ 7 chars
5
Collision check in DB
6
If collision โ†’ append counter
7
Return short_code
ANALYSIS
Pros
โœ“ Deterministic (same URL โ†’ same code)
โœ“ No central coordinator needed
Cons
โœ— Collision risk
โœ— Full-length URL matters โ†’ tiny changes change code
Auto-Increment + Base62
FLOW
1
Input: any URL
2
Increment global counter (atomic)
3
Get next integer ID (e.g. 12345678)
4
Base62 encode โ†’ 'dnh75'
5
Store (id, short_code, long_url)
6
Return short_code
ANALYSIS
Pros
โœ“ No collisions possible
โœ“ Simple and fast
Cons
โœ— Single point of failure (counter svc)
โœ— Predictable โ†’ enumerable by attackers
Cryptographically Random
FLOW
1
Input: any URL
2
Generate 7 random Base62 chars
3
Check uniqueness in DB
4
If exists โ†’ retry (low probability)
5
Store (short_code, long_url)
6
Return short_code
ANALYSIS
Pros
โœ“ Unpredictable codes
โœ“ No central state
Cons
โœ— Probability of collision grows with dataset
โœ— Requires DB round-trip to verify

โ†—๏ธ Read Path

Read / Redirect Sequence
Cache HIT Flow
ClientLoad BalancerRead ServiceRedis CacheURL DB (NoSQL)AnalyticsGET /abc123forward requestCACHE GET abc123Cache HIT โ†’ urlasync click event302 Redirect302 to long URL
Cache MISS Flow
ClientLoad BalancerRead ServiceRedis CacheURL DB (NoSQL)AnalyticsGET /abc123forward requestCACHE GET abc123Cache MISSGetItem(short_code)return long_urlCACHE SET abc123async click event302 Redirect302 to long URL

Redirect Status Codes

HTTP 301 โ€” Moved Permanently
Permanent redirect; browser caches it โ†’ fewer server hits but analytics won't capture repeats
HTTP 302 โ€” Found (Temporary)
Always hits server โ†’ accurate analytics. โœ… Recommended for bit.ly-style services

Cache Strategy

Cache-aside
Application checks cache, falls back to DB, then populates cache. Default choice.
Write-through
Write to cache and DB simultaneously. Higher write latency but no cache miss on immediate read.
TTL
Set 24h TTL by default; shorter for expiring URLs. LRU eviction for memory management.

โš–๏ธ Alternatives & Tradeoffs

Option: Monolith vs Microservices
โœ“ Pro: Microservices: read/write can scale independently; analytics isolation prevents perf impact
โœ— Con: Microservices: ops complexity, network latency between services; monolith fine for early stage
Option: Kafka vs SQS vs Redis Streams
โœ“ Pro: Kafka: high throughput, replay, exactly-once; ideal for analytics pipeline
โœ— Con: SQS: simpler ops, managed, but no replay; Redis Streams: in-memory, risk of data loss
Option: CDN caching redirects vs server-side
โœ“ Pro: Edge caching eliminates server load entirely for popular URLs (<1ms)
โœ— Con: 301 cached by browsers kills analytics; need 302 + Cache-Control headers carefully tuned
Step 05

Data Model

๐Ÿ“‹ Schema

NoSQL: urls (DynamoDB / Cassandra)Core mapping. Optimized for massive scale key-value lookups. Billions of rows.
json
{
  "PK": "abc123def",            // Partition Key (short_code)
  "long_url": "https://example.com/very/long/path",
  "user_id": "987654321",       // Used for Global Secondary Index (GSI)
  "created_at": 1705312800,
  "expires_at": 1736848800,
  "is_active": true
}
PostgreSQL: users & custom_domainsRelational metadata. Strict ACID compliance for billing and account states.
sql
CREATE TABLE users (
  id           BIGINT       PRIMARY KEY,
  email        VARCHAR(255) NOT NULL UNIQUE,
  plan_tier    VARCHAR(50)  DEFAULT 'free',
  created_at   TIMESTAMP    DEFAULT NOW()
);

CREATE TABLE custom_domains (
  id           BIGINT       PRIMARY KEY,
  user_id      BIGINT       NOT NULL REFERENCES users(id),
  domain       VARCHAR(255) NOT NULL UNIQUE
);
Columnar: clicks (ClickHouse)Append-only event log. Optimized for blazing fast aggregations.
sql
CREATE TABLE clicks (
  id           BIGINT       PRIMARY KEY,
  short_code   VARCHAR(8)   NOT NULL,
  clicked_at   TIMESTAMP    NOT NULL,
  country_code CHAR(2),
  device_type  VARCHAR(50)
) ENGINE=MergeTree()
PARTITION BY toYYYYMM(clicked_at)
ORDER BY (short_code, clicked_at);

๐Ÿ”— ER Diagram

urls (NoSQL)๐Ÿ”‘ PKshort_codelong_urlStringuser_idNumber (GSI)created_atNumberexpires_atNumberusers (SQL)๐Ÿ”‘ idBIGINT PKemailVARCHAR(255)api_keyVARCHAR(64)plan_tierENUMclicks (OLAP)๐Ÿ”‘ idBIGINT PKshort_codeVARCHAR(8)clicked_atTIMESTAMPcustom_domains (SQL)๐Ÿ”‘ idBIGINT PKuser_idBIGINT FKdomainVARCHAR(255)app logicalapp logicalFK
๐Ÿ’ก Polyglot Persistence Note: While this ER Diagram shows the logical relationships between our data entities, the physical implementation is distributed. Foreign keys like user_id in the URLs table are enforced at the application layer, not the database layer, because they live in entirely different database systems (PostgreSQL vs DynamoDB).
users (SQL) โ†’ urls (NoSQL): 1:N via App Logic
urls (NoSQL) โ†’ clicks (Columnar): 1:N via App Logic
users (SQL) โ†’ custom_domains (SQL): 1:N strict DB Foreign Key

๐Ÿ—„๏ธ Storage Choice

๐Ÿ˜PostgreSQL / MySQL
Used for: User data, billing, API keys
Why: ACID, strong consistency, easy relational queries for user management
Sharding: Standard replication (Primary + Replicas) scales well for metadata
๐Ÿ”ดRedis
Used for: Redirect cache (short_code โ†’ long_url)
Why: Sub-millisecond lookups, TTL support, LRU eviction, atomic counters
Sharding: Redis Cluster with hash slots (16384 slots auto-distributed)
๐Ÿ“ŠClickHouse / BigQuery
Used for: Analytics clicks table
Why: Columnar storage, blazing fast aggregations on billions of rows
Sharding: Partition by month; distributed across shards by url_id
๐Ÿ”ฑDynamoDB / Cassandra
Used for: URL mappings (Core Redirect Path)
Why: Perfect fit for Key-Value access patterns, seamless horizontal scale, high availability
Sharding: Partition key = short_code. Handles billions of rows natively without manual sharding.

โš–๏ธ Alternatives & Tradeoffs

Option: SQL vs NoSQL for URL mappings
โœ“ Pro: NoSQL completely eliminates the need for manual database sharding as the number of URLs grows to the billions.
โœ— Con: Using NoSQL for URLs means you can't easily JOIN with the users table. You must rely on secondary indexes (e.g., GSI in DynamoDB) to query all URLs for a specific user dashboard.
Option: Separate analytics DB vs append to OLTP
โœ“ Pro: Separate ClickHouse: OLAP queries don't impact redirect latency
โœ— Con: Extra infra; eventual consistency between OLTP and analytics store
Option: Denormalized click_count in urls vs COUNT(*) on clicks table
โœ“ Pro: Denormalized: O(1) read for total clicks display
โœ— Con: Needs atomic increment (Redis INCR + periodic sync); slight inconsistency acceptable
Step 06

API Design

Endpoints
Auth Header
Authorization: Bearer <jwt>
or API key via X-API-Key header
POST

/api/v1/shorten

Create a shortened URL

REQUEST
json
{
  "long_url":     "https://www.example.com/very/long/path?q=1",
  "custom_alias": "my-promo",      // optional
  "expires_at":   "2025-12-31T23:59:59Z", // optional ISO-8601
  "title":        "My Campaign Link",     // optional
  "tags":         ["marketing", "q4"]     // optional
}
RESPONSE
json
{
  "data": {
    "id":           "abc123def",
    "short_url":    "https://bit.ly/my-promo",
    "short_code":   "my-promo",
    "long_url":     "https://www.example.com/very/long/path?q=1",
    "created_at":   "2024-01-15T10:30:00Z",
    "expires_at":   "2025-12-31T23:59:59Z",
    "qr_code_url":  "https://bit.ly/my-promo/qr"
  }
}
ERROR CODES
400Invalid URL format or malicious URL detected
409Custom alias already taken
429Rate limit exceeded
GET

/{short_code}

Redirect to original URL (public endpoint)

REQUEST
json
No request body. Path param: short_code (e.g. my-promo)
RESPONSE
json
HTTP/1.1 302 Found
Location: https://www.example.com/very/long/path?q=1
X-RateLimit-Remaining: 999
Cache-Control: no-store
ERROR CODES
404Short code not found or never existed
410URL has expired (TTL passed)
GET

/api/v1/urls/{short_code}

Get metadata for a short URL (authenticated)

REQUEST
json
Headers: Authorization: Bearer <jwt_token>
RESPONSE
json
{
  "data": {
    "id":           "abc123def",
    "short_url":    "https://bit.ly/my-promo",
    "long_url":     "https://www.example.com/very/long/path?q=1",
    "click_count":  42891,
    "created_at":   "2024-01-15T10:30:00Z",
    "expires_at":   "2025-12-31T23:59:59Z",
    "is_active":    true
  }
}
ERROR CODES
401Missing or invalid token
403URL belongs to another user
404Short code not found
DELETE

/api/v1/urls/{short_code}

Deactivate / delete a short URL

REQUEST
json
Headers: Authorization: Bearer <jwt_token>
RESPONSE
json
HTTP/1.1 204 No Content
ERROR CODES
401Not authenticated
403Not the owner
404Short code not found
GET

/api/v1/urls/{short_code}/analytics

Get analytics for a URL

REQUEST
json
Query params:
  from=2024-01-01&to=2024-01-31
  granularity=day   // hour | day | week | month
  group_by=country  // country | device | referrer
RESPONSE
json
{
  "data": {
    "total_clicks": 42891,
    "unique_visitors": 18234,
    "time_series": [
      { "date": "2024-01-01", "clicks": 1423 },
      { "date": "2024-01-02", "clicks": 2105 }
    ],
    "breakdown": {
      "by_country":  [{ "country": "US", "clicks": 18000 }],
      "by_device":   [{ "device": "mobile", "clicks": 25000 }],
      "by_referrer": [{ "referrer": "twitter.com", "clicks": 9000 }]
    }
  }
}
ERROR CODES
401Not authenticated
400Invalid date range or granularity
PATCH

/api/v1/urls/{short_code}

Update URL metadata (title, expiry, destination)

REQUEST
json
{
  "long_url":   "https://new-destination.com",  // optional
  "title":      "Updated title",                 // optional
  "expires_at": "2026-06-30T00:00:00Z",          // optional
  "is_active":  true                             // optional
}
RESPONSE
json
{
  "data": { /* full url object */ }
}
ERROR CODES
401Not authenticated
400Invalid input

Rate Limiting Headers (all responses)

http
X-RateLimit-Limit:     1000
X-RateLimit-Remaining: 987
X-RateLimit-Reset:     1705312800  (Unix timestamp)
Retry-After:           60          (seconds, only on 429)

โš–๏ธ Alternatives & Tradeoffs

Option: REST vs GraphQL
โœ“ Pro: REST: simpler caching (GET is naturally cacheable), CDN-friendly, universal tooling
โœ— Con: GraphQL: flexible queries for analytics dashboard, but harder to cache, over-engineering for this use case
Option: 302 vs 307 for redirect
โœ“ Pro: 302 is universally supported for simple GET redirects and ensures analytics tracking.
โœ— Con: 307 strictly preserves POST payloads, but since redirects are always GETs in this scenario, this strictness adds no value here.
Option: JWT vs API Key auth
โœ“ Pro: JWT: stateless, contains claims, self-expiring; API Key: simple, good for server-to-server
โœ— Con: JWT: harder to revoke without blocklist; API Key: must be stored securely, simpler to revoke
Step 07

Deep Dive

๐Ÿ”‘ ID Generation

Snowflake ID Generation
IncomingRequestEpochTimestampMachineID (10 bit)Sequence(12 bit)64-bitSnowflake ID41 bits10 bits12 bitsTwitter Snowflake ID Generation

A 64-bit Snowflake ID is composed of: 41-bit timestamp (milliseconds since epoch), 10-bit machine ID (datacenter + worker), and a 12-bit sequence number(4096 IDs per millisecond per node). This gives us uniqueness without coordination between nodes. The numeric ID is then Base62-encoded to produce the short code.

Base62 Encoding
typescript
// Base62 charset: 0-9a-zA-Z (62 characters)
const CHARSET = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';

function toBase62(num: bigint): string {
  if (num === 0n) return CHARSET[0];
  let result = '';
  while (num > 0n) {
    result = CHARSET[Number(num % 62n)] + result;
    num = num / 62n;
  }
  return result;
}

// ID 12345678901 โ†’ "dnh75K"
Clock Skew Problem

If system clock moves backward (NTP correction), Snowflake may generate duplicate IDs.

Solutions:
  • Wait until clock catches up (if skew < 10ms)
  • Use logical clock + atomic monotonic sequence
  • Use Boundary for strict ordering
Alternatives
UUID v4 โ€” 16 bytes, random โ€” no ordering, collision risk
MD5 of URL โ€” Deterministic but collision risk
Counter + ZK โ€” Simple but ZK is SPOF
Snowflake โœ“ โ€” Best: ordered, distributed, compact

โฐ URL Expiry

Strategy Comparison
Option 1: Lazy Expiry
  1. User requests short URL
  2. Read service fetches URL
  3. Check expires_at field
  4. If expired โ†’ return 410 Gone
  5. Background job deletes old records
โš ๏ธ Expired URLs linger in DB until cleanup job runs. Simple but wastes storage.
Option 2: Redis TTL + DB Job (Chosen)
  1. Set Redis key with TTL = expires_at - now()
  2. Cache miss on TTL expiry โ†’ DB lookup
  3. DB returns 410 if expires_at passed
  4. Daily cron: DELETE FROM urls WHERE expires_at < NOW()
  5. Or: Mark is_active = FALSE for soft delete
โœ… Cache evicts automatically; DB cleaned up async. Best of both worlds.

๐Ÿšฆ Rate Limiting

Tier Architecture
Anonymous
Rate: 100 URLs/hr
Burst: 10 req/min
Free User
Rate: 1,000 URLs/hr
Burst: 100 req/min
Pro / Enterprise
Rate: 100K URLs/hr
Burst: 10K req/min
Token Bucket Algorithm (Redis)
lua
-- Lua script (atomic in Redis)
local key = "rate:" .. user_id
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])   -- tokens/sec
local burst = tonumber(ARGV[3])  -- max bucket size

local bucket = redis.call("HMGET", key, "tokens", "last")
local tokens = tonumber(bucket[1]) or burst
local last = tonumber(bucket[2]) or now

-- Refill tokens
local delta = math.max(0, now - last)
tokens = math.min(burst, tokens + delta * rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call("HMSET", key, "tokens", tokens, "last", now)
  redis.call("EXPIRE", key, 3600)
  return 1  -- allowed
end
return 0  -- denied
Sliding Window Alternative
lua
-- Sliding window log in Redis sorted set
local key = "swl:" .. user_id .. ":" .. minute
local now = tonumber(ARGV[1])
local window = 60  -- 60 second window
local limit = 100  -- max requests

-- Remove old entries outside window
redis.call("ZREMRANGEBYSCORE", key, 0, now - window)

-- Count current window
local count = redis.call("ZCARD", key)

if count < limit then
  redis.call("ZADD", key, now, now .. math.random())
  redis.call("EXPIRE", key, window)
  return 1  -- allowed
end
return 0  -- denied
Sliding window is accurate but uses O(requests) memory. Token bucket is O(1) per user.

๐ŸŒ Custom Domains

Setup Flow
  1. 1User adds domain in dashboard (e.g. go.acme.com)
  2. 2System provides CNAME record: go.acme.com โ†’ us.bitly.com
  3. 3User adds CNAME at their DNS provider
  4. 4System polls DNS; verifies CNAME resolution
  5. 5Auto-provisions TLS certificate (Let's Encrypt / ACM)
  6. 6Domain is activated; URLs with prefix work
Routing Architecture
Nginx / Load Balancer Config
nginx
# Wildcard cert + domain routing
server {
  listen 443 ssl;
  server_name *.bitly.com go.acme.com;
  
  ssl_certificate /etc/certs/wildcard.crt;
  
  location / {
    # Pass Host header; app resolves domain โ†’ user_id
    proxy_pass http://read-service;
    proxy_set_header Host $host;
  }
}
On redirect request: extract Host header โ†’ lookup custom_domains table โ†’ get user_id โ†’ namespace short code to that user.
Step 08

Bottlenecks & Mitigation

โ†—๏ธRead Path Scaling
โšก Bottleneck: 100K+ redirects/sec โ€” single service can't handle it
๐Ÿ› ๏ธ Mitigations:
โ†’ Horizontal scaling: stateless Read Service pods behind LB
โ†’ Redis Cluster: shard cache across nodes; 90%+ hit rate
โ†’ CDN edge caching: cache popular redirects at edge
โ†’ Read replicas: 3ร— PostgreSQL read replicas
โœ๏ธWrite Path Scaling
โšก Bottleneck: 1K writes/sec peak; DB can't absorb synchronous writes
๐Ÿ› ๏ธ Mitigations:
โ†’ Message Queue (Kafka): buffer writes; Service returns early
โ†’ Async DB workers: consume from Kafka, batch-insert
โ†’ Connection pooling: PgBouncer to avoid exhaustion
โ†’ Horizontal Write Service: stateless pods; ID generator is external
๐Ÿ›ข๏ธDatabase Sharding
โšก Bottleneck: 100M+ URLs; single DB hits storage/throughput limits
๐Ÿ› ๏ธ Mitigations:
โ†’ Hash-based sharding: shard_id = hash(short_code) % N_shards
โ†’ Consistent hashing ring: minimal rehashing on scale
โ†’ Virtual nodes (vnodes): distribute data evenly
โ†’ Avoid user-based sharding: prevents popular user hotspots
๐Ÿ”ฅHot URL / Celebrity Problem
โšก Bottleneck: Single URL getting millions of redirects/min
๐Ÿ› ๏ธ Mitigations:
โ†’ Cache hot URL at all layers: CDN + Redis + local in-memory LRU
โ†’ Sticky routing: route hot short_code to dedicated node
โ†’ Rate limit incoming clicks per short_code to prevent abuse
โ†’ Auto-detect: if click_rate > threshold, pre-load to CDN edge

โš–๏ธ Alternatives & Tradeoffs

Option: Consistent hashing vs modulo sharding
โœ“ Pro: Consistent hashing: adding a shard only rebalances 1/N of keys, not all
โœ— Con: Modulo sharding: simpler, no extra infrastructure, but resharding requires full migration
Option: Read replicas vs CQRS pattern
โœ“ Pro: CQRS: separate read model (denormalized) can be optimized differently from write model
โœ— Con: CQRS: significant complexity; read replicas are sufficient for 99% of use cases at this scale

๐Ÿ“š Quiz: Test Your Understanding

Check how well you learned the URL shortener system design. 20 questions.

Question 1 of 200 / 20 correct

What trade-off is made regarding the CAP theorem for the URL shortener system?

URL Shortener System Design ยท 8-Step Framework ยท Built for Senior/Staff Engineering Interviews
PostgreSQLRedisKafkaClickHouseBase62MicroservicesAPI GatewaySnowflake IDs