Databases Deep Dive — Types, Storage Formats, and MySQL vs NoSQL

A practical, in‑depth explainer on how different databases store data, with a special focus on MySQL (Relational) and NoSQL families. Includes examples, trade‑offs, and when to choose what.

Quick Glossary (hover the abbrs)

ACID CAP MVCC B+Tree LSM Tree OLTP OLAP DDL DML

Major Types of Databases & Their Data Models

1) Relational (RDBMS)

Data is organized into tables (relations) of rows and columns. Schema is defined upfront (schema‑on‑write). Joins are first‑class. Examples: MySQL, PostgreSQL, SQL Server, Oracle.

Storage style: generally row‑oriented; indexes via B+Trees. Strong ACID with MVCC.

2) Key‑Value Stores

Everything is a large dictionary: key → opaque value. Blazing fast for simple get/put. Examples: Redis (in‑memory), Riak, Amazon Dynamo‑style systems.

Storage style: hash tables, skip lists, or LSM. Values can be JSON, blobs, or binary.

3) Document Stores

Data stored as self‑describing documents (JSON/BSON). Flexible schema (schema‑on‑read). Examples: MongoDB, Couchbase, Firebase Firestore.

Storage style: B‑Trees or LSM + per‑document indexes; nested fields can be indexed.

4) Wide‑Column (Column‑Family)

Rows with sparse, dynamic columns grouped into families; great for time‑series/high write. Examples: Apache Cassandra, HBase.

Storage style: LSM‑based SSTables; sequential appends + compactions.

5) Graph Databases

Data as nodes and edges with properties; optimal for traversals (e.g., shortest path). Examples: Neo4j, JanusGraph.

Storage style: adjacency lists, index‑free adjacency, B‑Trees/LSM for properties.

6) Columnar Analytics (OLAP)

Stores by column for compression and vectorized scans; ideal for analytics dashboards. Examples: ClickHouse, BigQuery, Redshift.

Storage style: column files, dictionary encoding, run‑length encoding, Parquet/ORC.

7) Time‑Series

Optimized for timestamped metrics/events. Examples: InfluxDB, TimescaleDB (Postgres ext.).

Storage style: append‑only + downsampling/retention policies.

8) Others (Legacy / Niche)

Hierarchical (IMS), Network (CODASYL), Object‑Oriented (db4o), In‑Memory OLTP (Hekaton), NewSQL (CockroachDB, TiDB).

Physical Storage & Indexing (What lives on disk?)

Row‑Store vs Column‑Store

  • Row‑store: entire row laid out together → great for OLTP (many small inserts/updates). Typical for MySQL/InnoDB.
  • Column‑store: each column stored separately → superb for aggregation/analytics (OLAP). Typical for ClickHouse/Redshift.

B+Tree vs LSM Tree

  • B+Tree: balanced tree; great for random reads, range scans; steady writes.
  • LSM Tree: buffer writes in memory, flush sorted runs to disk, then compact. High write throughput; reads may touch multiple runs until compacted.

Indexes

  • Primary/Clustered index: defines the physical order (InnoDB clusters by primary key).
  • Secondary index: points to primary key (InnoDB) or direct record pointer.
  • Other: hash, bitmap, full‑text, geospatial (R‑tree), partial/functional indexes.

Durability Files

  • Redo/Write‑Ahead Logs (WAL): crash recovery.
  • Undo/Version logs: MVCC snapshots & rollbacks.
  • Checkpoints & snapshots: fast restart, backup/restore.

Deep Dive: MySQL (Relational RDBMS)

MySQL = ACID RDBMS with SQL, joins, constraints, transactions, and robust indexing. Default engine is InnoDB (row‑store, clustered primary key, MVCC).

Storage Engine (InnoDB)

  • Tablespaces & Pages: data and indexes live in 16KB (typical) pages within tablespaces (file‑per‑table or shared).
  • Clustered Primary Key: table data is ordered by the primary key; each secondary index stores the primary key as the row locator.
  • B+Trees: for both primary and secondary indexes; excellent range scans.

Logs & Durability

  • Redo Log (WAL): guarantees durability; flushed on commit (configurable innodb_flush_log_at_trx_commit).
  • Undo Log: supports MVCC and rollbacks.
  • Binary Log (binlog): for replication & PITR backups (row/statement/mixed formats).

Query Engine

  • Parser/Planner/Optimizer: builds plans, uses statistics, chooses indexes, join orders, access methods.
  • Isolation Levels: READ COMMITTED, REPEATABLE READ (default), SERIALIZABLE.
  • Locks: row‑level locking; gap/next‑key locks prevent phantom reads at RR.

Schema & Constraints

  • DDL: CREATE TABLE, ALTER, etc. Strong types: INT, DECIMAL, VARCHAR, JSON, DATE/TIME, spatial.
  • Constraints: PRIMARY/UNIQUE KEY, FOREIGN KEY (referential integrity), CHECK, NOT NULL.
  • Transactions: BEGIN/COMMIT/ROLLBACK with ACID guarantees.

Scaling & HA

  • Replication: async/semisync, GTID; read replicas for scale‑out reads.
  • Sharding: app‑level or proxy‑based (Vitess); partitioning supported per table.
  • Backups: logical (mysqldump) or physical (xtrabackup); snapshots via cloud.

Strengths

  • Strong consistency & integrity with joins and transactions.
  • Mature tooling & ecosystem; great for OLTP.
  • Powerful indexing, query optimization, and stored programs.

Deep Dive: NoSQL (Families & Storage Models)

"NoSQL" isn’t one product — it’s a family of stores designed around specific access patterns, typically prioritizing horizontal scale and flexible schema. Consistency models vary from strong to eventual depending on the system.

Document Stores (e.g., MongoDB)

  • Model: JSON/BSON documents in collections; schema optional, validation rules possible.
  • Indexes: B‑Trees on single fields, compound, text, geospatial; secondary indexes are common.
  • Transactions: Single‑document atomic by default; multi‑document ACID supported in modern versions.
  • Sharding: automatic via shard keys; replica sets for HA.

Key‑Value (e.g., Redis)

  • Model: in‑memory structures (strings, lists, sets, sorted sets, hashes) keyed by a string.
  • Use: caching, sessions, rate limiting, leaderboards; persistence optional via AOF/RDB.

Wide‑Column (e.g., Cassandra)

  • Model: partition key → ordered rows with flexible columns; denormalized for specific queries.
  • Storage: LSM‑based SSTables; high write throughput, tunable consistency.
  • Topology: masterless ring; linear horizontal scale, high availability.

Graph (e.g., Neo4j)

  • Model: nodes/edges with labels and properties; query with Cypher.
  • Use: relationships‑heavy queries: recommendations, fraud, knowledge graphs.

General NoSQL Trade‑offs

  • Schema flexibility: evolve fields anytime; fits event/user‑profile/content domains.
  • Scale‑out first: sharding/partitioning are built‑in (varies by system).
  • Consistency: from strong to eventual (tunable).
  • Joins: usually not native — model with denormalization or app‑side joins.

When NoSQL Shines

  • Massive write throughput (logs, metrics, IoT).
  • Flexible, nested content (catalogs, user content, CMS).
  • Globally distributed, high‑availability services.

MySQL vs NoSQL — Key Differences

Dimension MySQL (Relational) NoSQL (Family)
Data Model Tables, rows, fixed schema; strong types; foreign keys. Documents (JSON/BSON), key‑value, wide‑column, graphs; flexible schema.
Schema Schema‑on‑write; ALTER TABLE changes. Schema‑on‑read; evolve documents/columns freely.
Joins Native, optimized joins with indexes. Generally avoided; denormalize or app‑side joins.
Transactions Strong ACID across multiple rows/tables. Varies: single‑record atomicity is common; some support multi‑record ACID (e.g., MongoDB) but with constraints.
Consistency Model Strong (per transaction) with MVCC. Tunable/Eventually consistent common in distributed NoSQL; some offer strong consistency modes.
Scaling Vertical + read replicas; sharding via app/Vitess. Horizontal sharding/partitioning built‑in (varies by system).
Query Language SQL (declarative joins, window functions, aggregates). API‑style or JSON queries (Mongo), CQL (Cassandra), Cypher (Neo4j), commands (Redis).
Access Patterns Ad‑hoc queries across normalized data; complex joins. Pre‑planned queries; denormalized, partition‑friendly models.
Typical Use‑Cases Financial apps, orders, inventory, CRM, core business OLTP. Event streams, catalogs, content feeds, IoT metrics, caching, graph relationships.
Storage Structures Row‑store; B+Tree indexes; redo/undo logs. Document blobs (BSON/JSON), LSM or B‑Trees; compaction & replicas.
Analytics Works, but heavy OLAP better offloaded to columnar warehouses. Some provide aggregations, but serious OLAP → columnar/warehouse.

Rule of thumb: If you need strong integrity, complex joins, and transactional guarantees → start with MySQL. If you need flexible schemas, massive write throughput, global distribution, or graph traversals → pick the appropriate NoSQL type.

Hands‑on: Modeling the Same Domain

Domain: a simple Orders system with users, products, and orders containing line‑items.

MySQL (Normalized, Joins)

-- DDL
CREATE TABLE users (
  user_id BIGINT PRIMARY KEY,
  name VARCHAR(120) NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL
);

CREATE TABLE products (
  product_id BIGINT PRIMARY KEY,
  name VARCHAR(160) NOT NULL,
  price DECIMAL(10,2) NOT NULL
);

CREATE TABLE orders (
  order_id BIGINT PRIMARY KEY,
  user_id BIGINT NOT NULL,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  FOREIGN KEY (user_id) REFERENCES users(user_id)
);

CREATE TABLE order_items (
  order_id BIGINT,
  product_id BIGINT,
  qty INT NOT NULL,
  price_at_purchase DECIMAL(10,2) NOT NULL,
  PRIMARY KEY (order_id, product_id),
  FOREIGN KEY (order_id) REFERENCES orders(order_id),
  FOREIGN KEY (product_id) REFERENCES products(product_id)
);

-- Query: total spent by a user in a date range
SELECT u.user_id, u.name, SUM(oi.qty * oi.price_at_purchase) AS total_spent
FROM users u
JOIN orders o ON o.user_id = u.user_id
JOIN order_items oi ON oi.order_id = o.order_id
WHERE u.user_id = 42 AND o.created_at >= '2025-01-01' AND o.created_at < '2026-01-01'
GROUP BY u.user_id, u.name;

MongoDB (Document, Denormalized)

// Orders collection: embed items; reference user by id
{
  _id: ObjectId("..."),
  user_id: 42,
  created_at: ISODate("2025-06-01T10:20:00Z"),
  items: [
    { product_id: 1001, name: "Mouse", qty: 2, price_at_purchase: 499.00 },
    { product_id: 2002, name: "Keyboard", qty: 1, price_at_purchase: 1599.00 }
  ]
}

// Index for reporting by user and date
// db.orders.createIndex({ user_id: 1, created_at: 1 })

// Aggregation: total spent by a user in a year
[
  { $match: { user_id: 42, created_at: { $gte: ISODate('2025-01-01'), $lt: ISODate('2026-01-01') } } },
  { $unwind: "$items" },
  { $group: { _id: "$user_id", total_spent: { $sum: { $multiply: ["$items.qty", "$items.price_at_purchase"] } } } }
]

Notes on the Models

  • MySQL: Referential integrity and normalized tables avoid duplication; joins reconstruct order views.
  • MongoDB: Embedding keeps order + items together for fast order reads; product names/prices copied for historical accuracy (denormalized).

Performance Considerations

  • MySQL: Index joins; beware of N+1 queries; use covering indexes and careful PK design.
  • MongoDB: Choose shard key wisely; keep documents under size limits; avoid unbounded array growth.

How to Choose (Checklist)

Pick MySQL if…

  • You need strong ACID transactions across multiple tables.
  • Your queries require complex joins and constraints.
  • Data is highly structured and stable; integrity matters (finance, inventory).
  • You want mature tooling and predictable performance for OLTP.

Pick NoSQL (Document) if…

  • Schemas evolve rapidly (user profiles, content, catalogs).
  • You mostly read/write whole aggregates (a document at a time).
  • You need auto‑sharding and easy horizontal scale‑out.

Pick NoSQL (Wide‑Column / KV / Graph) if…

  • Extreme write throughput with predictable access patterns (Cassandra).
  • Low‑latency caching & ephemeral data (Redis).
  • Relationship‑centric queries (Neo4j).

Hybrid architectures are common: MySQL for core transactions, Redis for caching, Elasticsearch for search, and a warehouse (e.g., BigQuery/ClickHouse) for analytics.

References & Further Reading