Databases Deep Dive — Types, Storage Formats, and MySQL vs NoSQL
A practical, in‑depth explainer on how different databases store data, with a special focus on MySQL (Relational) and NoSQL families. Includes examples, trade‑offs, and when to choose what.
Quick Glossary (hover the abbrs)
ACID CAP MVCC B+Tree LSM Tree OLTP OLAP DDL DML
Major Types of Databases & Their Data Models
1) Relational (RDBMS)
Data is organized into tables (relations) of rows and columns. Schema is defined upfront (schema‑on‑write). Joins are first‑class. Examples: MySQL, PostgreSQL, SQL Server, Oracle.
Storage style: generally row‑oriented; indexes via B+Trees. Strong ACID with MVCC.
2) Key‑Value Stores
Everything is a large dictionary: key → opaque value. Blazing fast for simple
get/put. Examples: Redis (in‑memory), Riak, Amazon Dynamo‑style systems.
Storage style: hash tables, skip lists, or LSM. Values can be JSON, blobs, or binary.
3) Document Stores
Data stored as self‑describing documents (JSON/BSON). Flexible schema (schema‑on‑read). Examples: MongoDB, Couchbase, Firebase Firestore.
Storage style: B‑Trees or LSM + per‑document indexes; nested fields can be indexed.
4) Wide‑Column (Column‑Family)
Rows with sparse, dynamic columns grouped into families; great for time‑series/high write. Examples: Apache Cassandra, HBase.
Storage style: LSM‑based SSTables; sequential appends + compactions.
5) Graph Databases
Data as nodes and edges with properties; optimal for traversals (e.g., shortest path). Examples: Neo4j, JanusGraph.
Storage style: adjacency lists, index‑free adjacency, B‑Trees/LSM for properties.
6) Columnar Analytics (OLAP)
Stores by column for compression and vectorized scans; ideal for analytics dashboards. Examples: ClickHouse, BigQuery, Redshift.
Storage style: column files, dictionary encoding, run‑length encoding, Parquet/ORC.
7) Time‑Series
Optimized for timestamped metrics/events. Examples: InfluxDB, TimescaleDB (Postgres ext.).
Storage style: append‑only + downsampling/retention policies.
8) Others (Legacy / Niche)
Hierarchical (IMS), Network (CODASYL), Object‑Oriented (db4o), In‑Memory OLTP (Hekaton), NewSQL (CockroachDB, TiDB).
Physical Storage & Indexing (What lives on disk?)
Row‑Store vs Column‑Store
- Row‑store: entire row laid out together → great for OLTP (many small inserts/updates). Typical for MySQL/InnoDB.
- Column‑store: each column stored separately → superb for aggregation/analytics (OLAP). Typical for ClickHouse/Redshift.
B+Tree vs LSM Tree
- B+Tree: balanced tree; great for random reads, range scans; steady writes.
- LSM Tree: buffer writes in memory, flush sorted runs to disk, then compact. High write throughput; reads may touch multiple runs until compacted.
Indexes
- Primary/Clustered index: defines the physical order (InnoDB clusters by primary key).
- Secondary index: points to primary key (InnoDB) or direct record pointer.
- Other: hash, bitmap, full‑text, geospatial (R‑tree), partial/functional indexes.
Durability Files
- Redo/Write‑Ahead Logs (WAL): crash recovery.
- Undo/Version logs: MVCC snapshots & rollbacks.
- Checkpoints & snapshots: fast restart, backup/restore.
Deep Dive: MySQL (Relational RDBMS)
MySQL = ACID RDBMS with SQL, joins, constraints, transactions, and robust indexing. Default engine is InnoDB (row‑store, clustered primary key, MVCC).
Storage Engine (InnoDB)
- Tablespaces & Pages: data and indexes live in 16KB (typical) pages within tablespaces (file‑per‑table or shared).
- Clustered Primary Key: table data is ordered by the primary key; each secondary index stores the primary key as the row locator.
- B+Trees: for both primary and secondary indexes; excellent range scans.
Logs & Durability
- Redo Log (WAL): guarantees durability; flushed on commit (configurable
innodb_flush_log_at_trx_commit). - Undo Log: supports MVCC and rollbacks.
- Binary Log (binlog): for replication & PITR backups (row/statement/mixed formats).
Query Engine
- Parser/Planner/Optimizer: builds plans, uses statistics, chooses indexes, join orders, access methods.
- Isolation Levels: READ COMMITTED, REPEATABLE READ (default), SERIALIZABLE.
- Locks: row‑level locking; gap/next‑key locks prevent phantom reads at RR.
Schema & Constraints
- DDL: CREATE TABLE, ALTER, etc. Strong types: INT, DECIMAL, VARCHAR, JSON, DATE/TIME, spatial.
- Constraints: PRIMARY/UNIQUE KEY, FOREIGN KEY (referential integrity), CHECK, NOT NULL.
- Transactions: BEGIN/COMMIT/ROLLBACK with ACID guarantees.
Scaling & HA
- Replication: async/semisync, GTID; read replicas for scale‑out reads.
- Sharding: app‑level or proxy‑based (Vitess); partitioning supported per table.
- Backups: logical (mysqldump) or physical (xtrabackup); snapshots via cloud.
Strengths
- Strong consistency & integrity with joins and transactions.
- Mature tooling & ecosystem; great for OLTP.
- Powerful indexing, query optimization, and stored programs.
Deep Dive: NoSQL (Families & Storage Models)
"NoSQL" isn’t one product — it’s a family of stores designed around specific access patterns, typically prioritizing horizontal scale and flexible schema. Consistency models vary from strong to eventual depending on the system.
Document Stores (e.g., MongoDB)
- Model: JSON/BSON documents in collections; schema optional, validation rules possible.
- Indexes: B‑Trees on single fields, compound, text, geospatial; secondary indexes are common.
- Transactions: Single‑document atomic by default; multi‑document ACID supported in modern versions.
- Sharding: automatic via shard keys; replica sets for HA.
Key‑Value (e.g., Redis)
- Model: in‑memory structures (strings, lists, sets, sorted sets, hashes) keyed by a string.
- Use: caching, sessions, rate limiting, leaderboards; persistence optional via AOF/RDB.
Wide‑Column (e.g., Cassandra)
- Model: partition key → ordered rows with flexible columns; denormalized for specific queries.
- Storage: LSM‑based SSTables; high write throughput, tunable consistency.
- Topology: masterless ring; linear horizontal scale, high availability.
Graph (e.g., Neo4j)
- Model: nodes/edges with labels and properties; query with Cypher.
- Use: relationships‑heavy queries: recommendations, fraud, knowledge graphs.
General NoSQL Trade‑offs
- Schema flexibility: evolve fields anytime; fits event/user‑profile/content domains.
- Scale‑out first: sharding/partitioning are built‑in (varies by system).
- Consistency: from strong to eventual (tunable).
- Joins: usually not native — model with denormalization or app‑side joins.
When NoSQL Shines
- Massive write throughput (logs, metrics, IoT).
- Flexible, nested content (catalogs, user content, CMS).
- Globally distributed, high‑availability services.
MySQL vs NoSQL — Key Differences
| Dimension | MySQL (Relational) | NoSQL (Family) |
|---|---|---|
| Data Model | Tables, rows, fixed schema; strong types; foreign keys. | Documents (JSON/BSON), key‑value, wide‑column, graphs; flexible schema. |
| Schema | Schema‑on‑write; ALTER TABLE changes. | Schema‑on‑read; evolve documents/columns freely. |
| Joins | Native, optimized joins with indexes. | Generally avoided; denormalize or app‑side joins. |
| Transactions | Strong ACID across multiple rows/tables. | Varies: single‑record atomicity is common; some support multi‑record ACID (e.g., MongoDB) but with constraints. |
| Consistency Model | Strong (per transaction) with MVCC. | Tunable/Eventually consistent common in distributed NoSQL; some offer strong consistency modes. |
| Scaling | Vertical + read replicas; sharding via app/Vitess. | Horizontal sharding/partitioning built‑in (varies by system). |
| Query Language | SQL (declarative joins, window functions, aggregates). | API‑style or JSON queries (Mongo), CQL (Cassandra), Cypher (Neo4j), commands (Redis). |
| Access Patterns | Ad‑hoc queries across normalized data; complex joins. | Pre‑planned queries; denormalized, partition‑friendly models. |
| Typical Use‑Cases | Financial apps, orders, inventory, CRM, core business OLTP. | Event streams, catalogs, content feeds, IoT metrics, caching, graph relationships. |
| Storage Structures | Row‑store; B+Tree indexes; redo/undo logs. | Document blobs (BSON/JSON), LSM or B‑Trees; compaction & replicas. |
| Analytics | Works, but heavy OLAP better offloaded to columnar warehouses. | Some provide aggregations, but serious OLAP → columnar/warehouse. |
Rule of thumb: If you need strong integrity, complex joins, and transactional guarantees → start with MySQL. If you need flexible schemas, massive write throughput, global distribution, or graph traversals → pick the appropriate NoSQL type.
Hands‑on: Modeling the Same Domain
Domain: a simple Orders system with users, products, and orders containing line‑items.
MySQL (Normalized, Joins)
-- DDL
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
name VARCHAR(120) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL
);
CREATE TABLE products (
product_id BIGINT PRIMARY KEY,
name VARCHAR(160) NOT NULL,
price DECIMAL(10,2) NOT NULL
);
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
CREATE TABLE order_items (
order_id BIGINT,
product_id BIGINT,
qty INT NOT NULL,
price_at_purchase DECIMAL(10,2) NOT NULL,
PRIMARY KEY (order_id, product_id),
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
-- Query: total spent by a user in a date range
SELECT u.user_id, u.name, SUM(oi.qty * oi.price_at_purchase) AS total_spent
FROM users u
JOIN orders o ON o.user_id = u.user_id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE u.user_id = 42 AND o.created_at >= '2025-01-01' AND o.created_at < '2026-01-01'
GROUP BY u.user_id, u.name;
MongoDB (Document, Denormalized)
// Orders collection: embed items; reference user by id
{
_id: ObjectId("..."),
user_id: 42,
created_at: ISODate("2025-06-01T10:20:00Z"),
items: [
{ product_id: 1001, name: "Mouse", qty: 2, price_at_purchase: 499.00 },
{ product_id: 2002, name: "Keyboard", qty: 1, price_at_purchase: 1599.00 }
]
}
// Index for reporting by user and date
// db.orders.createIndex({ user_id: 1, created_at: 1 })
// Aggregation: total spent by a user in a year
[
{ $match: { user_id: 42, created_at: { $gte: ISODate('2025-01-01'), $lt: ISODate('2026-01-01') } } },
{ $unwind: "$items" },
{ $group: { _id: "$user_id", total_spent: { $sum: { $multiply: ["$items.qty", "$items.price_at_purchase"] } } } }
]
Notes on the Models
- MySQL: Referential integrity and normalized tables avoid duplication; joins reconstruct order views.
- MongoDB: Embedding keeps order + items together for fast order reads; product names/prices copied for historical accuracy (denormalized).
Performance Considerations
- MySQL: Index joins; beware of N+1 queries; use covering indexes and careful PK design.
- MongoDB: Choose shard key wisely; keep documents under size limits; avoid unbounded array growth.
How to Choose (Checklist)
Pick MySQL if…
- You need strong ACID transactions across multiple tables.
- Your queries require complex joins and constraints.
- Data is highly structured and stable; integrity matters (finance, inventory).
- You want mature tooling and predictable performance for OLTP.
Pick NoSQL (Document) if…
- Schemas evolve rapidly (user profiles, content, catalogs).
- You mostly read/write whole aggregates (a document at a time).
- You need auto‑sharding and easy horizontal scale‑out.
Pick NoSQL (Wide‑Column / KV / Graph) if…
- Extreme write throughput with predictable access patterns (Cassandra).
- Low‑latency caching & ephemeral data (Redis).
- Relationship‑centric queries (Neo4j).
Hybrid architectures are common: MySQL for core transactions, Redis for caching, Elasticsearch for search, and a warehouse (e.g., BigQuery/ClickHouse) for analytics.