Scalability Patterns in Distributed Systems
A practical, implementation-first tour of the patterns I reach for when a backend has to survive real load — sharding, load balancing, caching and back-pressure — with concrete notes from production Python and Go services.
1. What is a distributed system?
A distributed system is a set of independent processes — usually running on different machines — that cooperate over a network to look like a single service. The defining property is not "many servers"; it's that partial failure is normal. A node can be slow, a link can drop packets, a disk can fill up, and the system as a whole still has to make forward progress.
Every scalability pattern below is really a way to manage one of three finite resources: state (data that has to live somewhere), compute (CPU cycles to act on it) and bandwidth (the bytes moving between them).
2. Sharding — splitting state by key
Sharding (a.k.a. horizontal partitioning) splits one logical dataset across N physical stores using a routing key. Each shard owns a disjoint slice of the keyspace, so reads and writes for a given key always land on the same node. This is how Postgres-per-tenant, Kafka topic partitions, Redis Cluster and DynamoDB all scale writes past a single box.
Pick a shard key first, everything else second
The shard key decides your ceiling. A good key is high-cardinality, evenly distributed, and matches your most common access pattern — usually tenant_id, user_id or a hash of one. A bad key (a date, a status enum, a country) concentrates traffic on a few shards and you'll be back where you started.
Consistent hashing in Python
Naive modulo sharding (shard = hash(key) % N) breaks the moment you change N — almost every key remaps. Consistent hashing keeps most keys in place when nodes are added or removed:
pythonimport bisect import hashlib class HashRing: def __init__(self, nodes: list[str], vnodes: int = 128): self.ring: list[tuple[int, str]] = [] for node in nodes: for i in range(vnodes): h = self._hash(f"{node}#{i}") self.ring.append((h, node)) self.ring.sort() @staticmethod def _hash(key: str) -> int: return int(hashlib.blake2b(key.encode(), digest_size=8).hexdigest(), 16) def route(self, key: str) -> str: h = self._hash(key) i = bisect.bisect(self.ring, (h, "")) return self.ring[i % len(self.ring)][1] ring = HashRing(["db-a", "db-b", "db-c"]) ring.route("tenant:42") # -> "db-b"
Virtual nodes (the vnodes loop) flatten the load distribution — without them small rings get very lumpy.
Range vs hash sharding in Go
Hash sharding spreads writes evenly but kills range scans. Range sharding keeps neighbors together — great for time-series and "last 100 events for user X", terrible if the newest range is also the hottest. In Go services I usually wrap the choice behind a small interface so the rest of the code never sees it:
gotype Router interface { Shard(key string) string } type HashRouter struct{ ring *HashRing } func (r *HashRouter) Shard(key string) string { return r.ring.Route(key) } type RangeRouter struct{ ranges []Range } // sorted by Start func (r *RangeRouter) Shard(key string) string { i := sort.Search(len(r.ranges), func(i int) bool { return r.ranges[i].End > key }) return r.ranges[i].Node }
3. Load balancing — splitting work across replicas
Where sharding splits state, load balancing splits requests across interchangeable replicas. The algorithm you pick matters more than people expect:
- Round-robin — fine when requests are uniform and replicas are identical. Falls over the moment one request is 100× heavier than the next.
- Least-connections — good default for variable-latency work (LLM calls, scraping, image processing). Naturally drains slow replicas.
- Power of two choices (P2C) — pick two replicas at random, send the request to the less-loaded one. Almost as good as least-connections, far cheaper to coordinate.
- Consistent-hash routing — sticky by key, so caches on each replica stay warm.
P2C in Go
gotype Backend struct { URL string InFlight atomic.Int64 } func (p *Pool) Pick() *Backend { a := p.backends[rand.IntN(len(p.backends))] b := p.backends[rand.IntN(len(p.backends))] if a.InFlight.Load() <= b.InFlight.Load() { return a } return b }
At Vivox AI this exact pattern, plus per-replica circuit breakers, let a single Go Scraping Gateway hold 1–2k+ concurrent outbound requests without any one upstream tipping over.
Health checks beat clever algorithms
A balancer that routes to a dead node is worse than a dumb one. Pair any strategy with active health checks (periodic probe) and passive ones (mark a replica unhealthy after N consecutive errors, retry after a backoff). Without that, a single bad pod takes down your success rate.
4. Caching — buying time and money back
Caching is the cheapest scalability lever you have, and the easiest to get subtly wrong. Three strategies cover ~90% of real systems:
Cache-aside (lazy)
App reads from cache, falls back to DB on miss, writes back. Simple, tolerant of cache failures, and what you want by default.
pythonasync def get_user(user_id: str) -> User: key = f"user:{user_id}" if cached := await redis.get(key): return User.model_validate_json(cached) user = await db.fetch_user(user_id) await redis.set(key, user.model_dump_json(), ex=300) return user
Write-through and write-behind
Write-through updates cache and DB synchronously — stronger consistency, slower writes. Write-behind buffers writes in the cache and flushes asynchronously — fast, but you can lose data on crash and ordering becomes your problem.
The three things that actually bite
- Thundering herd — a hot key expires and 10k requests hit the DB at once. Fix with a per-key lock (
SET NXin Redis) or request coalescing (Go'ssingleflight, Python'sasyncio.Lockkeyed by ID). - Stale-while-revalidate — serve the stale value, refresh in the background. Latency stays flat, the DB sees a trickle instead of a spike.
- Invalidation — pick one owner per key. Writers invalidate; readers never. TTL is your safety net, not your strategy.
singleflight in Go
govar group singleflight.Group func GetUser(ctx context.Context, id string) (*User, error) { v, err, _ := group.Do("user:"+id, func() (any, error) { if u, ok := cache.Get(id); ok { return u, nil } u, err := db.FetchUser(ctx, id) if err == nil { cache.Set(id, u, 5*time.Minute) } return u, err }) if err != nil { return nil, err } return v.(*User), nil }
5. Back-pressure, retries & timeouts
Sharding and load balancing buy capacity. Back-pressure decides what happens when you still run out. The default for most services should be: fail fast, fail loud, shed load. A request stuck waiting 60 seconds is worse than a 503 in 50ms — the slow one is holding a goroutine, a DB connection and a queue slot the next request needs.
- Every external call has a timeout. No exceptions. Default 1–3s for sync paths, longer only for explicit background jobs.
- Retry only idempotent calls and only with exponential backoff + jitter. Blind retries turn a 5-second blip into a 5-minute outage.
- Bounded queues, bounded pools. Unbounded channels and connection pools are how memory graphs go vertical.
- Circuit breakers stop hammering a downstream that's already on fire — close the breaker after a probe succeeds, not on a timer.
6. A practical checklist
Before calling a backend "scalable", I run through this list. If any answer is "we'll see", that's the next thing to fix:
- □ What is the shard key, and what happens when it skews?
- □ Which load-balancing strategy, and how are unhealthy replicas removed?
- □ For each hot read: which cache strategy, what TTL, who invalidates?
- □ Does every external call have a timeout, a retry policy, and a circuit breaker?
- □ Are queues and pools bounded, with a defined behavior when full?
- □ What metric tells me, in under a minute, that one of the above is failing?
None of these patterns are exotic. The discipline is applying them before you need them, and being honest about which finite resource you're actually defending.