Sharding
Learn what sharding is, how MongoDB distributes data across multiple servers using shard keys, and when your application actually needs it.
Sharding
Everything in this guide so far has assumed your data fits on one server — even with replica sets (for transactions and change streams), every server in the replica set holds a full copy of the data. That works for the vast majority of applications, including a school management system, even with hundreds of thousands of students.
Sharding is for a different problem — when your data becomes too large, or your traffic too high, for any single server to handle. Sharding splits your data horizontally across multiple servers, where each server holds only a portion of the data.
Vertical vs Horizontal Scaling
Before sharding, there is a simpler option — make your server bigger. More RAM, faster disks, more CPU cores. This is vertical scaling, and it is always the first thing to try.
Vertical scaling has a ceiling — eventually you cannot buy a bigger server, or it becomes prohibitively expensive. Horizontal scaling — adding more servers, each holding part of the data — has no such ceiling. Sharding is MongoDB's horizontal scaling solution.
| Vertical Scaling | Horizontal Scaling (Sharding) | |
|---|---|---|
| How | Bigger server | More servers |
| Complexity | Simple — no app changes | Complex — requires shard key design |
| Ceiling | Limited by hardware | Effectively unlimited |
| When | First option, always try this | When vertical scaling hits its limit |
How Sharding Works
A sharded cluster splits one logical database across multiple shards — each shard is itself a replica set. Data is distributed based on a shard key you choose.
Shards — each shard stores a portion of the total data. Each shard is itself a replica set for redundancy.
Query Router (mongos) — your application connects to mongos, not directly to shards. It routes each query to the correct shard(s) based on the shard key.
Config Servers — store metadata about which data lives on which shard.
From your application's perspective, a sharded cluster looks like a single MongoDB server — you connect, run queries, get results. The complexity of routing and distribution is handled internally.
Choosing a Shard Key
The shard key is the field (or fields) MongoDB uses to decide which shard a document belongs to. This is the single most important decision in sharding — choosing badly leads to uneven data distribution and poor performance, and it is very difficult to change later.
Example — Sharding Students by _id
sh.shardCollection("school.students", { _id: "hashed" }){ _id: "hashed" } means MongoDB hashes each document's _id and uses the hash to determine which shard it goes to. This spreads documents evenly across shards — good for write distribution.
A Good Shard Key Has:
High cardinality — many distinct values. _id is unique per document — excellent cardinality. A field like grade (only 4 possible values) is terrible — all 9th graders would pile onto whichever shard handles "9th".
Even distribution — writes spread evenly across shards, not concentrated on one.
Matches your query patterns — queries that include the shard key can be routed to specific shards. Queries without it must check every shard (called a "scatter-gather" query) — slow.
Example — Bad Shard Key
// BAD — only 4 possible values
sh.shardCollection("school.students", { grade: 1 })All "10th" grade students end up on the same shard. If 10th grade has the most students, that shard becomes overloaded while others sit idle — called a hot shard.
Example — Better Shard Key for Query Patterns
// Better — combines a high-cardinality field with a query-relevant field
sh.shardCollection("school.students", { grade: 1, _id: "hashed" })This is a compound shard key — grade groups related documents for query efficiency, while the hashed _id ensures even distribution within each grade.
Changing a shard key after data is already sharded is a major operation — historically it required completely re-sharding the collection. Choose carefully, based on your actual query patterns, before sharding a collection with significant data.
When You Need Sharding
Sharding solves real problems — but only specific ones:
Data exceeds what one server can store — your dataset is hundreds of gigabytes or terabytes, larger than what fits comfortably on a single server's disk.
Working set exceeds available RAM — MongoDB performs best when frequently-accessed data fits in RAM. If your "hot" data is larger than any single server's RAM, performance degrades badly.
Write throughput exceeds a single server's capacity — even with good indexes, a single primary node can only handle so many writes per second.
Geographic distribution — you need data physically located close to users in different regions, with shards in different data centers.
When You Do NOT Need Sharding
This is the more important section for most developers — the vast majority of applications never need sharding.
Our school system — even a large school district with 500,000 students, each with a few KB of data, totals a few GB. A single modern server handles this with ease — sharding would add massive operational complexity for zero benefit.
500,000 students × ~5KB per document ≈ 2.5GB
This fits comfortably in RAM on almost any modern server.
A replica set (for redundancy) is all you need — no sharding.Most startups and small-to-medium applications — unless you are operating at the scale of a major social network, a large e-commerce platform, or an IoT system with millions of sensors sending data every second, a properly indexed, well-designed replica set handles your load.
"We might need it later" — sharding adds operational complexity, requires careful shard key design upfront, and is hard to undo. Premature sharding is a common and costly mistake. Build with a replica set, monitor your actual growth, and shard only when you have real evidence you need to.
What Would Push Our School System Toward Sharding
For our single-school system, sharding will never be necessary. But imagine evolving this into a multi-tenant SaaS platform — one system serving thousands of schools, each with their own students, teachers, and courses.
1,000 schools × 1,000 students each = 1,000,000 students
Plus attendance records — 1,000,000 students × 200 school days/year
= 200,000,000 attendance documents per yearAt this scale:
- Total data size grows into hundreds of GB or more, especially with several years of attendance history
- Write load from 1,000 schools simultaneously recording attendance, grades, and enrollments could exceed a single primary's capacity
- Different schools might need data residency in different regions (data privacy regulations)
A Reasonable Shard Key for Multi-Tenant SaaS
// Shard by schoolId — keeps each school's data together,
// distributes different schools across shards
sh.shardCollection("platform.students", { schoolId: 1, _id: "hashed" })schoolId as the primary shard key field means:
- All data for one school tends to live on the same shard — queries scoped to a school (the vast majority of queries) hit only one shard
- Different schools are distributed across shards — no single shard handles all the load
- The hashed
_idwithin eachschoolIdprevents any single very-large school from creating a hot spot
This is the kind of architecture decision that matters only at this scale — and is exactly the kind of decision that requires careful upfront shard key design, because changing it later is costly.
The Decision Framework
Quick Reference
| Concept | What it means |
|---|---|
| Shard | A replica set holding a portion of the data |
| mongos | Query router — your app connects here in a sharded cluster |
| Config servers | Store metadata about data distribution |
| Shard key | Field(s) determining which shard a document goes to |
| Hashed shard key | Even distribution, but no range-based co-location |
| Compound shard key | Combines a query-relevant field with a high-cardinality field |
| Hot shard | A shard receiving disproportionate load due to a poor shard key |
If you take one thing from this file — sharding is a scaling tool for a specific, measurable problem (data size, throughput, or geography exceeding a single server's limits), not a default architecture choice. For our school management system, and for the overwhelming majority of real applications, a well-indexed replica set — everything we covered from Basics through Performance — is the right architecture, full stop.