9 min read

I built a database without using a database

What building a file-based event store from scratch taught me about byte offsets, crash recovery, and how real storage engines actually work.

backendnode.jstypescriptsystemsstorage
Terminal output showing a Node.js server recovering events from an append-only log file on startup
backendnode.jstypescriptsystems

I had three options for my HNG task. A retry engine, a notification batcher, and an append-only event store.

I picked the event store because I couldn't fully explain how it would work. That's usually a good sign.

The constraint was interesting: build an HTTP API that stores and retrieves events — no database allowed. No PostgreSQL, no SQLite, no Redis. Just the filesystem and whatever I could build on top of it. What followed was one of the more clarifying backend problems I've worked on. By the end, I understood databases differently.

Why "just scan the file" doesn't cut it

My first instinct was simple: append events to a flat file, scan the whole file on every read. It works. You can ship it. But it's O(N) — every read gets slower as the file grows. Ten events, you don't notice. Ten thousand, you do. A million, your server is effectively broken.

Real storage systems don't scan. They know exactly where to look.

The insight that unlocked the whole design: if you record the byte offset of each event at write time, reads become trivial. "Seek to this position, read this many bytes." The file can grow forever and the read cost stays constant. That single idea is the backbone of the entire project.

The write path

Every POST /events does three things in sequence. It serializes the incoming JSON into a single newline-terminated string, measures the exact byte length of that string, appends it to events.log on disk, and stores the offset and length in an in-memory index.

store.ts — append and index
TS
const line = JSON.stringify(event) + "\n"; const length = Buffer.byteLength(line, "utf8"); const offset = currentFileSize; await fileHandle.write(line); index.set(event.id, { offset, length }); currentFileSize += length;

The in-memory index is a Map<string, { offset: number; length: number }>. It's the thing that makes reads fast. Without it, you're back to scanning.

One thing worth being explicit about: the append is atomic at the syscall level. Either the whole line lands on disk or nothing does. Previous entries are never touched. That matters a lot for crash recovery, which I'll get to.

The read path

GET /events/:id is almost embarrassingly simple once the index exists.

store.ts — seek and read
TS
const entry = index.get(id); if (!entry) return null; const buffer = Buffer.alloc(entry.length); await fileHandle.read(buffer, 0, entry.length, entry.offset); return JSON.parse(buffer.toString("utf8"));

Look up the index, get the offset and length, seek the file, read exactly those bytes, parse and return. The read complexity is O(1) to locate and O(length) to fetch. The length is always tiny — it's one JSON object, not the entire log.

No scanning. No loading the file into memory. Direct seeks.

The bug I didn't see coming: string length vs byte length

Early in development I was computing offsets using line.length. This appeared to work fine. Then I tested with emoji in the event payload.

An emoji is one JavaScript character. In UTF-8 on disk, it's four bytes. If your offset tracking uses character count instead of byte count, your seek pointer lands in the wrong place. You either get a parse error or — worse — silently read the wrong data.

The fix is Buffer.byteLength(line, "utf8") everywhere you measure something that will be written to disk. Once I made that change, everything worked correctly regardless of what characters the events contained. This is one of those bugs that's invisible until it isn't, and then it's confusing until you understand the underlying model.

Crash recovery

This is the part I found most satisfying to implement.

The in-memory index doesn't persist between server restarts. When the process exits, it's gone. On the next startup, the server knows nothing — all it has is the log file on disk.

The recovery approach: stream events.log line by line using Node's readline module, track the byte position manually as you go, and re-populate the index from each parsed line. You end up with the same index you had before the crash, without ever loading the full file into memory.

store.ts — startup recovery
TS
async function recover(logPath: string, index: Map<string, IndexEntry>) { let offset = 0; const rl = readline.createInterface({ input: fs.createReadStream(logPath), }); for await (const line of rl) { if (!line.trim()) continue; const byteLength = Buffer.byteLength(line + "\n", "utf8"); const event = JSON.parse(line); index.set(event.id, { offset, length: byteLength }); offset += byteLength; } console.log(`Recovered ${index.size} events from ${logPath}`); }

When the server comes back up, you see Recovered N events from /path/to/events.log in the terminal. Then every previously written event is immediately retrievable. No data loss, no manual intervention, no external state store.

The key thing readline gives you here is streaming. You're not doing fs.readFileSync and loading gigabytes of log into memory. You're reading one line at a time, tracking position, and building the index incrementally. Even if the log has grown large, startup stays cheap.

How this connects to real storage systems

Once I had this working, I kept noticing how the same patterns appear everywhere in production storage.

Write-ahead logs. PostgreSQL maintains a WAL — a sequence of records describing every change made to the database. On crash recovery, Postgres replays the WAL to rebuild consistent state. What I built is a stripped-down version of exactly this: the log file is the source of truth, the in-memory index is derived from it.

Kafka's storage model. Kafka stores messages in append-only partition log files and uses offsets to track consumer positions. Consumers don't ask "give me message X" by ID — they ask "give me the message at offset N." The mental model is identical to what I built, just distributed across brokers and replicated.

LSM trees. Log-structured merge trees (used in LevelDB, RocksDB, Cassandra) are built on the same principle: writes are always appends, reads use an index, compaction happens in the background to manage file size. The append-only constraint isn't a limitation — it's a deliberate design choice that makes writes fast and crash recovery clean.

The architecture end to end

The system has three layers that interact in a straightforward way.

The log file (events.log) is the permanent record. Newline-delimited JSON, one event per line, never modified. If you delete the in-memory index and the process, the log file alone is enough to reconstruct everything.

The in-memory index is a Map that lives only in the running process. It's fast to query and cheap to rebuild. It holds no data — just pointers into the log file.

The HTTP layer is thin Express routes that delegate everything to the store. It doesn't know about files or offsets — it calls store.append(event) and store.get(id) and trusts the store to handle the rest.

TEXT
Write path POST /events └─ serialize to JSON line └─ measure Buffer.byteLength └─ append to events.log └─ update index { offset, length } └─ return 201 + stored event
TEXT
Read path GET /events/:id └─ index.get(id) └─ seek file to offset └─ read length bytes └─ JSON.parse + return

Separating the store from the HTTP layer also made testing significantly easier. The unit tests exercise the store directly without spinning up a server. The integration tests start the full server, write events, restart it, and verify that reads still work after recovery. That restart proof is the most important test in the suite.

What TypeScript made hard

Node's FileHandle API has some rough edges when you're working with it in TypeScript. The read method signature has changed across Node versions, and the types don't always reflect what the runtime actually does. I spent more time than I'd like to admit fighting nullability around the append handle and figuring out the correct overload for fileHandle.read with a pre-allocated Buffer.

The lesson there wasn't really about TypeScript — it was about reading the actual Node.js documentation instead of guessing from types. The fs.promises API docs have the exact method signatures, buffer behavior, and position semantics spelled out. Once I went there instead of relying on IDE autocomplete, things clicked.

What changed about how I think about databases

Before this project, databases were opaque to me. You connect, you query, data appears. The internals felt irrelevant to using them well.

That's still mostly true for day-to-day work. But knowing what's underneath changes how you reason about tradeoffs. Why does PostgreSQL have a WAL? Because crash recovery requires a log. Why does Kafka use byte offsets for consumer positions? Because seeking is O(1) and scanning is not. Why do write-heavy systems reach for append-only storage? Because you can't corrupt data you never modify.

These aren't abstract concepts anymore. I implemented tiny versions of all of them in a weekend. The implementation was small, but what I got from it wasn't.

If you want to understand how a database works, don't read about databases. Build one, however small, with whatever constraints you can find. The constraint here was "no database" — which turned out to be exactly the right constraint for learning what databases do.

The source code is on GitHub if you want to read through it: github.com/DammyCodes-all. The store layer is small enough to read in one sitting, and the tests are a good place to start if you want to understand how the pieces fit together.

Previous
Next
Background Image

thoughtsandnotes

here i share ideas, document experiments, and talk about things i'm learning or building along the way. it's more of a quiet corner for insights, progress, and everything in between.

YOU MADE IT THIS FAR

Let’s build something exceptional together.

Tell me about your project and I’ll help you shape it into a polished, high-performing experience.