🔗

URL Shortener

System Design

Interview ReadySenior / Staff Level

System Design InterviewDistributed Systemsbit.ly / tinyurl clone

Design URL Shortener

A production-grade URL shortening service capable of handling billions of redirects. This walkthrough covers high-throughput read paths, distributed Snowflake ID generation, caching strategies, robust rate limiting, and includes an interactive quiz.

Step 01

Functional Requirements

Core Features

🔗

URL Shortening

Given a long URL, generate a unique short URL (≤8 chars)

↗️

URL Redirection

Redirect short URL to original with <50ms p99 latency

⏰

Custom Expiry

Support TTL / expiration dates per link

✏️

Custom Aliases

Allow users to define custom slugs (e.g. /my-promo)

📊

Analytics

Track click count, referrer, geo, user-agent

🔒

Access Control

Optional password protection or private links

Out of Scope (V1)

○ Real-time click stream dashboards
○ Link-in-bio landing pages
○ QR code generation (V2)
○ Team / organization management
○ Browser extension
○ Bulk URL upload

User Journeys

👤 Anonymous User

→ Shorten URL (limited)

→ Follow redirect

→ View public analytics

👤✓ Registered User

→ Create custom alias

→ Set expiry

→ View own analytics dashboard

→ Manage & delete links

⚙️ API Consumer

→ API key authentication

→ Bulk shortening

→ Webhook on clicks

→ Programmatic analytics

⚖️ Alternatives & Tradeoffs

Option: Custom Aliases vs. Auto-generated only

✓ Pro: Higher user value, brand-friendly links

✗ Con: Namespace conflicts, reservation squatting

Option: Analytics in same service vs. separate

✓ Pro: Fewer network hops if co-located

✗ Con: Analytics load shouldn't impact redirect latency; separate is better

Option: Expiry via TTL vs. background job

✓ Pro: Redis TTL is instant and zero overhead

✗ Con: Stale DB entries if only using Redis TTL; need DB cleanup job too

Step 02

Non-Functional Requirements

⚡Performance

›Read Latency: < 10ms p50 / < 50ms p99
›Write Latency: < 200ms p99
›Throughput: 100K reads/sec, 1K writes/sec at peak

🟢Availability

›SLA: 99.99% uptime (52 min downtime/year)
›Failover: Auto-failover < 30 seconds
›Redundancy: Multi-AZ deployment; no SPOF

📈Scalability

›URLs stored: 100M+ unique short URLs
›Horizontal: Stateless services, scale-out pattern
›DB sharding: Shard on short_code hash

🛡️Durability

›Persistence: URLs must not be lost
›Backup: Daily snapshots + point-in-time recovery
›Replication: 3× replicas across 2+ AZs

🔐Security

›Auth: JWT tokens + API key
›Rate Limiting: 100 URLs/hr anon, 10K/hr premium
›Malicious URLs: Google Safe Browsing API check

🔄Consistency

›Model: Eventual consistency for analytics OK
›URL reads: Strong consistency (read-your-writes)
›Uniqueness: Strict guarantee for short_code

CAP Theorem Positioning

The URL shortener prioritizes Availability (A) + Partition Tolerance (P) — i.e. an AP system.

If a node goes down, we prefer serving stale cache over returning errors. For the write path (URL creation), we briefly sacrifice availability for consistency to guarantee unique short codes.

⚖️ Alternatives & Tradeoffs

Option: Strong vs. Eventual Consistency for redirects

✓ Pro: Eventual: cache can serve redirects without DB round-trip, massive latency gain

✗ Con: A deleted URL might still redirect for seconds (cache TTL window)

Option: 99.99% vs 99.999% SLA

✓ Pro: 99.99% is achievable with multi-AZ; far lower cost

✗ Con: 99.999% needs active-active multi-region, significantly higher infra cost

Option: Synchronous vs Async analytics write

✓ Pro: Async: redirect path stays at <10ms; async queue absorbs spikes

✗ Con: Analytics may have seconds delay; small risk of click loss on queue failure

Step 03

Back-of-the-Envelope Estimation

📊 Traffic Assumptions

DAU	100 million
Read:Write Ratio	100:1
New URLs/day	1 million
Redirects/day	100 million
Write RPS	~12 req/s
Read RPS	~1,200 req/s
Peak Read RPS (10×)	~12,000 req/s

🗄️ Storage & Cache

URL record size	~500 bytes
New records/year	365M
Storage/year	~182 GB
5-year storage	~1 TB
With replication (3×)	~3 TB
Click events/day	100M @ 200B = 20 GB
Click data (1 yr)	~7.3 TB

🔢 Short Code Math

Base62 charset: a-z A-Z 0-9 = 62 chars

6-char code

62⁶ = ~56 billion

7-char code

62⁷ = ~3.5 trillion ← chosen

8-char code

62⁸ = ~218 trillion

At 1M new URLs/day, a 7-char code space lasts ~9,589 years.

⚖️ Alternatives & Tradeoffs

Option: 7-char vs 6-char codes

✓ Pro: 7-char gives 3.5 trillion permutations, effectively infinite for any real system

✗ Con: 1 extra char in URL is negligible cost; 6-char is sufficient for 10 years at current scale

Option: Cache all vs LRU eviction

✓ Pro: LRU eviction keeps cache small; hot URLs naturally stay in cache

✗ Con: Cold start problem: first access always misses; warm-up strategy needed post-deploy

Step 04

High-Level Design

🏗️ Architecture Overview

This High-Level Design outlines how the URL Shortener efficiently handles both read-heavy redirect traffic and write-heavy URL creation spikes. Client requests hit the CDN and Load Balancer, which route traffic to stateless Read or Write microservices. A caching layer provides sub-millisecond lookups, while an asynchronous message queue buffers database writes and analytics to ensure the core system remains highly responsive under load. Click on any component below to learn more about its specific role.

Click on nodes for details · Dashed lines = async paths

👆 Click any component

Client

User interfaces making API calls to shorten URLs or browsers following redirect links.

CDN / Edge

Caches static assets and highly popular URL redirects at the edge, reducing latency and backend load.

Load Balancer

Distributes incoming network traffic evenly across stateless backend services.

Write Service

Validates input, requests unique IDs, and stores the short-to-long URL mappings in the database and cache.

Read/Redirect Service

Performs cache-first lookups for short codes and returns 302 HTTP redirects. Falls back to DB on cache misses.

Analytics Service

Consumes click events asynchronously from the message queue and aggregates them for reporting.

Redis Cache

In-memory datastore holding popular URL mappings for sub-millisecond read access.

URL DB (NoSQL)

Highly scalable Key-Value store (e.g., DynamoDB) serving as the primary source of truth for URL mappings.

Metadata DB (SQL)

Relational database (e.g., PostgreSQL) for user accounts, billing, and custom domain configurations.

Message Queue

Event bus (e.g., Kafka) that buffers writes during flash traffic and decouples the critical path from analytics.

ID Generator

Standalone distributed service (e.g., Twitter Snowflake) providing guaranteed unique IDs without database locking.

Analytics DB

Columnar database (e.g., ClickHouse) highly optimized for fast, large-scale OLAP queries on click data.

✍️ Write Path

Write Path Sequence

Short Code Generation Strategies

MD5 / Base62 Hash

FLOW

Input: long URL

MD5(long_url) → 128-bit hash

Take first 43 bits

Base62 encode → 7 chars

Collision check in DB

If collision → append counter

Return short_code

ANALYSIS

Pros

✓ Deterministic (same URL → same code)

✓ No central coordinator needed

Cons

✗ Collision risk

✗ Full-length URL matters → tiny changes change code

Auto-Increment + Base62

FLOW

Input: any URL

Increment global counter (atomic)

Get next integer ID (e.g. 12345678)

Base62 encode → 'dnh75'

Store (id, short_code, long_url)

Return short_code

ANALYSIS

Pros

✓ No collisions possible

✓ Simple and fast

Cons

✗ Single point of failure (counter svc)

✗ Predictable → enumerable by attackers

Cryptographically Random

FLOW

Input: any URL

Generate 7 random Base62 chars

Check uniqueness in DB

If exists → retry (low probability)

Store (short_code, long_url)

Return short_code

ANALYSIS

Pros

✓ Unpredictable codes

✓ No central state

Cons

✗ Probability of collision grows with dataset

✗ Requires DB round-trip to verify

↗️ Read Path

Read / Redirect Sequence

Cache HIT Flow

Cache MISS Flow

Redirect Status Codes

HTTP 301 — Moved Permanently

Permanent redirect; browser caches it → fewer server hits but analytics won't capture repeats

HTTP 302 — Found (Temporary)

Always hits server → accurate analytics. ✅ Recommended for bit.ly-style services

Cache Strategy

Cache-aside

Application checks cache, falls back to DB, then populates cache. Default choice.

Write-through

Write to cache and DB simultaneously. Higher write latency but no cache miss on immediate read.

TTL

Set 24h TTL by default; shorter for expiring URLs. LRU eviction for memory management.

⚖️ Alternatives & Tradeoffs

Option: Monolith vs Microservices

✓ Pro: Microservices: read/write can scale independently; analytics isolation prevents perf impact

✗ Con: Microservices: ops complexity, network latency between services; monolith fine for early stage

Option: Kafka vs SQS vs Redis Streams

✓ Pro: Kafka: high throughput, replay, exactly-once; ideal for analytics pipeline

✗ Con: SQS: simpler ops, managed, but no replay; Redis Streams: in-memory, risk of data loss

Option: CDN caching redirects vs server-side

✓ Pro: Edge caching eliminates server load entirely for popular URLs (<1ms)

✗ Con: 301 cached by browsers kills analytics; need 302 + Cache-Control headers carefully tuned

Step 05

Data Model

📋 Schema

NoSQL: urls (DynamoDB / Cassandra)Core mapping. Optimized for massive scale key-value lookups. Billions of rows.

json

{
  "PK": "abc123def",            // Partition Key (short_code)
  "long_url": "https://example.com/very/long/path",
  "user_id": "987654321",       // Used for Global Secondary Index (GSI)
  "created_at": 1705312800,
  "expires_at": 1736848800,
  "is_active": true
}

PostgreSQL: users & custom_domainsRelational metadata. Strict ACID compliance for billing and account states.

sql

CREATE TABLE users (
  id           BIGINT       PRIMARY KEY,
  email        VARCHAR(255) NOT NULL UNIQUE,
  plan_tier    VARCHAR(50)  DEFAULT 'free',
  created_at   TIMESTAMP    DEFAULT NOW()
);

CREATE TABLE custom_domains (
  id           BIGINT       PRIMARY KEY,
  user_id      BIGINT       NOT NULL REFERENCES users(id),
  domain       VARCHAR(255) NOT NULL UNIQUE
);

Columnar: clicks (ClickHouse)Append-only event log. Optimized for blazing fast aggregations.

sql

CREATE TABLE clicks (
  id           BIGINT       PRIMARY KEY,
  short_code   VARCHAR(8)   NOT NULL,
  clicked_at   TIMESTAMP    NOT NULL,
  country_code CHAR(2),
  device_type  VARCHAR(50)
) ENGINE=MergeTree()
PARTITION BY toYYYYMM(clicked_at)
ORDER BY (short_code, clicked_at);

🔗 ER Diagram

💡 Polyglot Persistence Note: While this ER Diagram shows the logical relationships between our data entities, the physical implementation is distributed. Foreign keys like user_id in the URLs table are enforced at the application layer, not the database layer, because they live in entirely different database systems (PostgreSQL vs DynamoDB).

users (SQL) → urls (NoSQL): 1:N via App Logic

urls (NoSQL) → clicks (Columnar): 1:N via App Logic

users (SQL) → custom_domains (SQL): 1:N strict DB Foreign Key

🗄️ Storage Choice

🐘PostgreSQL / MySQL

Used for: User data, billing, API keys

Why: ACID, strong consistency, easy relational queries for user management

Sharding: Standard replication (Primary + Replicas) scales well for metadata

🔴Redis

Used for: Redirect cache (short_code → long_url)

Why: Sub-millisecond lookups, TTL support, LRU eviction, atomic counters

Sharding: Redis Cluster with hash slots (16384 slots auto-distributed)

📊ClickHouse / BigQuery

Used for: Analytics clicks table

Why: Columnar storage, blazing fast aggregations on billions of rows

Sharding: Partition by month; distributed across shards by url_id

🔱DynamoDB / Cassandra

Used for: URL mappings (Core Redirect Path)

Why: Perfect fit for Key-Value access patterns, seamless horizontal scale, high availability

Sharding: Partition key = short_code. Handles billions of rows natively without manual sharding.

⚖️ Alternatives & Tradeoffs

Option: SQL vs NoSQL for URL mappings

✓ Pro: NoSQL completely eliminates the need for manual database sharding as the number of URLs grows to the billions.

✗ Con: Using NoSQL for URLs means you can't easily JOIN with the users table. You must rely on secondary indexes (e.g., GSI in DynamoDB) to query all URLs for a specific user dashboard.

Option: Separate analytics DB vs append to OLTP

✓ Pro: Separate ClickHouse: OLAP queries don't impact redirect latency

✗ Con: Extra infra; eventual consistency between OLTP and analytics store

Option: Denormalized click_count in urls vs COUNT(*) on clicks table

✓ Pro: Denormalized: O(1) read for total clicks display

✗ Con: Needs atomic increment (Redis INCR + periodic sync); slight inconsistency acceptable

Step 06

API Design

Endpoints

Auth Header

Authorization: Bearer <jwt>

or API key via X-API-Key header

POST

/api/v1/shorten

Create a shortened URL

REQUEST

json

{
  "long_url":     "https://www.example.com/very/long/path?q=1",
  "custom_alias": "my-promo",      // optional
  "expires_at":   "2025-12-31T23:59:59Z", // optional ISO-8601
  "title":        "My Campaign Link",     // optional
  "tags":         ["marketing", "q4"]     // optional
}

RESPONSE

json

{
  "data": {
    "id":           "abc123def",
    "short_url":    "https://bit.ly/my-promo",
    "short_code":   "my-promo",
    "long_url":     "https://www.example.com/very/long/path?q=1",
    "created_at":   "2024-01-15T10:30:00Z",
    "expires_at":   "2025-12-31T23:59:59Z",
    "qr_code_url":  "https://bit.ly/my-promo/qr"
  }
}

ERROR CODES

400Invalid URL format or malicious URL detected

409Custom alias already taken

429Rate limit exceeded

GET

/{short_code}

Redirect to original URL (public endpoint)

REQUEST

json

No request body. Path param: short_code (e.g. my-promo)

RESPONSE

json

HTTP/1.1 302 Found
Location: https://www.example.com/very/long/path?q=1
X-RateLimit-Remaining: 999
Cache-Control: no-store

ERROR CODES

404Short code not found or never existed

410URL has expired (TTL passed)

GET

/api/v1/urls/{short_code}

Get metadata for a short URL (authenticated)

REQUEST

json

Headers: Authorization: Bearer <jwt_token>

RESPONSE

json

{
  "data": {
    "id":           "abc123def",
    "short_url":    "https://bit.ly/my-promo",
    "long_url":     "https://www.example.com/very/long/path?q=1",
    "click_count":  42891,
    "created_at":   "2024-01-15T10:30:00Z",
    "expires_at":   "2025-12-31T23:59:59Z",
    "is_active":    true
  }
}

ERROR CODES

401Missing or invalid token

403URL belongs to another user

404Short code not found

DELETE

/api/v1/urls/{short_code}

Deactivate / delete a short URL

REQUEST

json

Headers: Authorization: Bearer <jwt_token>

RESPONSE

json

HTTP/1.1 204 No Content

ERROR CODES

401Not authenticated

403Not the owner

404Short code not found

GET

/api/v1/urls/{short_code}/analytics

Get analytics for a URL

REQUEST

json

Query params:
  from=2024-01-01&to=2024-01-31
  granularity=day   // hour | day | week | month
  group_by=country  // country | device | referrer

RESPONSE

json

{
  "data": {
    "total_clicks": 42891,
    "unique_visitors": 18234,
    "time_series": [
      { "date": "2024-01-01", "clicks": 1423 },
      { "date": "2024-01-02", "clicks": 2105 }
    ],
    "breakdown": {
      "by_country":  [{ "country": "US", "clicks": 18000 }],
      "by_device":   [{ "device": "mobile", "clicks": 25000 }],
      "by_referrer": [{ "referrer": "twitter.com", "clicks": 9000 }]
    }
  }
}

ERROR CODES

401Not authenticated

400Invalid date range or granularity

PATCH

/api/v1/urls/{short_code}

Update URL metadata (title, expiry, destination)

REQUEST

json

{
  "long_url":   "https://new-destination.com",  // optional
  "title":      "Updated title",                 // optional
  "expires_at": "2026-06-30T00:00:00Z",          // optional
  "is_active":  true                             // optional
}

RESPONSE

json

{
  "data": { /* full url object */ }
}

ERROR CODES

401Not authenticated

400Invalid input

Rate Limiting Headers (all responses)

http

X-RateLimit-Limit:     1000
X-RateLimit-Remaining: 987
X-RateLimit-Reset:     1705312800  (Unix timestamp)
Retry-After:           60          (seconds, only on 429)

⚖️ Alternatives & Tradeoffs

Option: REST vs GraphQL

✓ Pro: REST: simpler caching (GET is naturally cacheable), CDN-friendly, universal tooling

✗ Con: GraphQL: flexible queries for analytics dashboard, but harder to cache, over-engineering for this use case

Option: 302 vs 307 for redirect

✓ Pro: 302 is universally supported for simple GET redirects and ensures analytics tracking.

✗ Con: 307 strictly preserves POST payloads, but since redirects are always GETs in this scenario, this strictness adds no value here.

Option: JWT vs API Key auth

✓ Pro: JWT: stateless, contains claims, self-expiring; API Key: simple, good for server-to-server

✗ Con: JWT: harder to revoke without blocklist; API Key: must be stored securely, simpler to revoke

Step 07

Deep Dive

🔑 ID Generation

Snowflake ID Generation

A 64-bit Snowflake ID is composed of: 41-bit timestamp (milliseconds since epoch), 10-bit machine ID (datacenter + worker), and a 12-bit sequence number(4096 IDs per millisecond per node). This gives us uniqueness without coordination between nodes. The numeric ID is then Base62-encoded to produce the short code.

Base62 Encoding

typescript

// Base62 charset: 0-9a-zA-Z (62 characters)
const CHARSET = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';

function toBase62(num: bigint): string {
  if (num === 0n) return CHARSET[0];
  let result = '';
  while (num > 0n) {
    result = CHARSET[Number(num % 62n)] + result;
    num = num / 62n;
  }
  return result;
}

// ID 12345678901 → "dnh75K"

Clock Skew Problem

If system clock moves backward (NTP correction), Snowflake may generate duplicate IDs.

Solutions:

Wait until clock catches up (if skew < 10ms)
Use logical clock + atomic monotonic sequence
Use Boundary for strict ordering

Alternatives

UUID v4 — 16 bytes, random — no ordering, collision risk

MD5 of URL — Deterministic but collision risk

Counter + ZK — Simple but ZK is SPOF

Snowflake ✓ — Best: ordered, distributed, compact

⏰ URL Expiry

Strategy Comparison

Option 1: Lazy Expiry

User requests short URL
Read service fetches URL
Check expires_at field
If expired → return 410 Gone
Background job deletes old records

⚠️ Expired URLs linger in DB until cleanup job runs. Simple but wastes storage.

Option 2: Redis TTL + DB Job (Chosen)

Set Redis key with TTL = expires_at - now()
Cache miss on TTL expiry → DB lookup
DB returns 410 if expires_at passed
Daily cron: DELETE FROM urls WHERE expires_at < NOW()
Or: Mark is_active = FALSE for soft delete

✅ Cache evicts automatically; DB cleaned up async. Best of both worlds.

🚦 Rate Limiting

Tier Architecture

Anonymous

Rate: 100 URLs/hr

Burst: 10 req/min

Free User

Rate: 1,000 URLs/hr

Burst: 100 req/min

Pro / Enterprise

Rate: 100K URLs/hr

Burst: 10K req/min

Token Bucket Algorithm (Redis)

lua

-- Lua script (atomic in Redis)
local key = "rate:" .. user_id
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])   -- tokens/sec
local burst = tonumber(ARGV[3])  -- max bucket size

local bucket = redis.call("HMGET", key, "tokens", "last")
local tokens = tonumber(bucket[1]) or burst
local last = tonumber(bucket[2]) or now

-- Refill tokens
local delta = math.max(0, now - last)
tokens = math.min(burst, tokens + delta * rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call("HMSET", key, "tokens", tokens, "last", now)
  redis.call("EXPIRE", key, 3600)
  return 1  -- allowed
end
return 0  -- denied

Sliding Window Alternative

lua

-- Sliding window log in Redis sorted set
local key = "swl:" .. user_id .. ":" .. minute
local now = tonumber(ARGV[1])
local window = 60  -- 60 second window
local limit = 100  -- max requests

-- Remove old entries outside window
redis.call("ZREMRANGEBYSCORE", key, 0, now - window)

-- Count current window
local count = redis.call("ZCARD", key)

if count < limit then
  redis.call("ZADD", key, now, now .. math.random())
  redis.call("EXPIRE", key, window)
  return 1  -- allowed
end
return 0  -- denied

Sliding window is accurate but uses O(requests) memory. Token bucket is O(1) per user.

🌐 Custom Domains

Setup Flow

1User adds domain in dashboard (e.g. go.acme.com)
2System provides CNAME record: go.acme.com → us.bitly.com
3User adds CNAME at their DNS provider
4System polls DNS; verifies CNAME resolution
5Auto-provisions TLS certificate (Let's Encrypt / ACM)
6Domain is activated; URLs with prefix work

Routing Architecture

Nginx / Load Balancer Config

nginx

# Wildcard cert + domain routing
server {
  listen 443 ssl;
  server_name *.bitly.com go.acme.com;
  
  ssl_certificate /etc/certs/wildcard.crt;
  
  location / {
    # Pass Host header; app resolves domain → user_id
    proxy_pass http://read-service;
    proxy_set_header Host $host;
  }
}

On redirect request: extract Host header → lookup custom_domains table → get user_id → namespace short code to that user.

Step 08

Bottlenecks & Mitigation

↗️Read Path Scaling

⚡ Bottleneck: 100K+ redirects/sec — single service can't handle it

🛠️ Mitigations:

→ Horizontal scaling: stateless Read Service pods behind LB

→ Redis Cluster: shard cache across nodes; 90%+ hit rate

→ CDN edge caching: cache popular redirects at edge

→ Read replicas: 3× PostgreSQL read replicas

✍️Write Path Scaling

⚡ Bottleneck: 1K writes/sec peak; DB can't absorb synchronous writes

🛠️ Mitigations:

→ Message Queue (Kafka): buffer writes; Service returns early

→ Async DB workers: consume from Kafka, batch-insert

→ Connection pooling: PgBouncer to avoid exhaustion

→ Horizontal Write Service: stateless pods; ID generator is external

🛢️Database Sharding

⚡ Bottleneck: 100M+ URLs; single DB hits storage/throughput limits

🛠️ Mitigations:

→ Hash-based sharding: shard_id = hash(short_code) % N_shards

→ Consistent hashing ring: minimal rehashing on scale

→ Virtual nodes (vnodes): distribute data evenly

→ Avoid user-based sharding: prevents popular user hotspots

🔥Hot URL / Celebrity Problem

⚡ Bottleneck: Single URL getting millions of redirects/min

🛠️ Mitigations:

→ Cache hot URL at all layers: CDN + Redis + local in-memory LRU

→ Sticky routing: route hot short_code to dedicated node

→ Rate limit incoming clicks per short_code to prevent abuse

→ Auto-detect: if click_rate > threshold, pre-load to CDN edge

⚖️ Alternatives & Tradeoffs

Option: Consistent hashing vs modulo sharding

✓ Pro: Consistent hashing: adding a shard only rebalances 1/N of keys, not all

✗ Con: Modulo sharding: simpler, no extra infrastructure, but resharding requires full migration

Option: Read replicas vs CQRS pattern

✓ Pro: CQRS: separate read model (denormalized) can be optimized differently from write model

✗ Con: CQRS: significant complexity; read replicas are sufficient for 99% of use cases at this scale

📚 Quiz: Test Your Understanding

Check how well you learned the URL shortener system design. 20 questions.

Question 1 of 200 / 20 correct

What trade-off is made regarding the CAP theorem for the URL shortener system?

URL Shortener System Design · 8-Step Framework · Built for Senior/Staff Engineering Interviews

PostgreSQLRedisKafkaClickHouseBase62MicroservicesAPI GatewaySnowflake IDs