Engineering a compliance-grade SaaS: multi-tenancy, RBAC, and audit trails
How I replaced a Belgian accounting practice's email-and-spreadsheet workflow with one type-safe API serving three role-scoped portals — a full Belgian AML compliance module, 92+ RBAC permissions, and an append-only audit trail underneath.
Regulated accounting work is, at its core, an evidence problem. A firm has to prove who its clients are, that it screened them, and when it did so — and it has to be able to reproduce that proof years later for a supervisor. The firm I built this platform for was carrying all of that on email threads, shared drives, and a sprawl of spreadsheets. It worked, until you asked the one question regulators always ask: *show me the record*.
Under Belgium's anti-money-laundering law (the Anti-witwaswet, AWW) and the professional norms of the institute that supervises accountants (ITAA), an accounting practice is a gatekeeper. It owes auditable client identification, ongoing sanctions and politically-exposed-person screening, a documented risk assessment, and record-keeping that survives the loss of any one employee's inbox. Ad-hoc tooling cannot guarantee any of that — not because spreadsheets are bad, but because they have no notion of *who changed what, when, and under what authority*.
So I built a single platform to be the firm's source of truth: one type-safe API serving three role-scoped portals, a complete Belgian AML compliance lifecycle, real-time cross-portal collaboration, and an append-only audit trail under every sensitive action. This post is the engineering teardown — anonymized, but technically faithful. I'll walk through the multi-tenant architecture, an RBAC model that stays sane at 92+ permissions, the compliance module, and the real-time layer, then close with the lessons that generalize to any regulated SaaS.
This is a real, in-production client project under confidentiality. The firm, its people, and the product's brand name are deliberately withheld. Every number here (three portals, 36+ modules, 80+ events, 92+ permissions) is a public, client-approved figure. Code samples are generic illustrations of the patterns, not copies of the client's source.
1. One API, three portals
The hardest product decision came first: how many applications is this, really? A naive read says three — a back-office app for the firm, a tool for delegated staff, and a self-service app for the firm's clients. Build it that way and you get three codebases, three drifting copies of the same business rules, and a compliance nightmare where the audit trail depends on which app happened to write the row.
I built it as one modular NestJS 11 API with one PostgreSQL schema behind Prisma 7, fronted by a single React 19 SPA that renders three role-scoped experiences. The portals differ in what they expose, never in the logic underneath. A client uploading a document and an assistant approving it hit the same service, the same validation, the same audit hook — they just see different surfaces. 36+ feature modules live in that one API; the portal you're in decides which ones light up.
1.1 Three portals, one type-safe contract
Each portal maps to a role: Admin (the firm — operations, client management, staff, periods, compliance oversight), Assistant (delegated staff — assigned work, validation, task management), and Client (the firm's own clients — multi-company self-service for documents, questionnaires, and submissions). Capabilities are additive and role-scoped: a client can submit, an assistant can validate, an admin can configure.
| Capability | Admin | Assistant | Client |
|---|---|---|---|
| Manage clients & staff | Full | View assigned | — |
| Upload documents | Yes | Yes | Own company only |
| Validate / approve submissions | Yes | Yes (assigned) | — |
| Submit documents & questionnaires | Yes | — | Yes |
| Run AML identification & screening | Yes | Limited | — |
| Configure roles & permissions | Yes | — | — |
| Read the audit trail | Yes | — | — |
| Multi-company switching | All clients | Assigned | Own companies |
The contract between SPA and API is type-safe end to end. Prisma-generated types flow from the database through the service layer; Zod schemas and class-validator DTOs enforce the same shape at the boundary at both compile time and runtime. The frontend consumes it through typed TanStack Query hooks, with Zustand holding session state. There is no hand-written any at the seam between client and server — which matters enormously when a single shape (say, a client's identification record) is read by three different portals.
1.2 Tenant data isolation
The trust boundary that matters most: a client must never see another client's data, and an assistant must only see the clients assigned to them. I treat the firm's clients (and, one level down, each client's companies) as the unit of tenancy. Isolation is enforced at the data-access layer — every client-scoped query is constrained by the caller's identity and assignment, never by what the UI chose to request.
Concretely, authorization is layered: a JWT auth guard establishes *who you are*, role guards establish *what kind of actor you are*, a resource-owner guard establishes *whether this specific record is yours to touch*, and the permission engine establishes *whether your role grants this action*. A request only succeeds when all four agree. The portal you're in is a UI affordance, not a security control — the same record requested by the wrong actor is rejected at the service layer regardless of which portal asked.
The cheapest place to leak data is a list endpoint that trusts a client-supplied filter. Bind every tenant-scoped query to the authenticated subject server-side. If the UI wants to show 'my companies,' the server should already only be capable of returning the caller's companies — the filter is a convenience, never the boundary.
2. RBAC that stays sane at 92+ permissions
Three roles sounds simple. Then the firm asks: "Can this one assistant approve questionnaires but not delete clients?" and "Give this senior the audit log but nothing else extra." Coarse roles can't express that without spawning a new role for every exception — and role explosion is how RBAC systems rot. The platform settled on 92+ fine-grained permissions with a three-tier resolution model that keeps per-user flexibility without abandoning role defaults.
Permissions are named domain.action — for example documents.upload, questionnaires.approve, audit.read. Roles are bundles of those keys. On top of a role, an individual user can carry explicit overrides: a grant that adds a permission their role lacks, or a revoke that takes one away. Resolution runs in strict priority order: a per-user override wins; absent that, the role's grants apply (with wildcard support, so documents.* covers every documents. action); absent that, a legacy fallback. The result is cached briefly per user-and-permission so the check is cheap on the hot path, and the cache is invalidated the instant permissions change.
// Generic illustration of a domain.action RBAC check.// Priority: explicit per-user override > role grant (wildcards) > deny. type Override = { key: string; granted: boolean }; function matches(required: string, owned: string[]): boolean { for (const grant of owned) { if (grant === required) return true; // Wildcard: "documents.*" covers "documents.upload" if (grant.endsWith(".*") && required.startsWith(grant.slice(0, -1))) { return true; } } return false;} function can( required: string, // e.g. "questionnaires.approve" rolePermissions: string[], // bundle granted by the user's role userOverrides: Override[], // explicit per-user grants/revokes): boolean { // 1. An explicit per-user override always wins (grant OR revoke). const override = userOverrides.find((o) => o.key === required); if (override) return override.granted; // 2. Otherwise fall back to the role bundle (with wildcard support). return matches(required, rolePermissions);}In the running system this lives in a guard. An endpoint declares its requirement with a decorator — @RequirePermission('questionnaires.approve') — and a single PermissionGuard reads that metadata, resolves the caller's effective permissions, and either lets the request through or throws a 403 naming the missing permission. The controller method stays clean; the authorization decision is declarative and lives next to the route it protects.
Make permissions first-class rows, not hard-coded enums sprinkled through controllers — then a non-engineer can compose a role in an admin screen without a deploy.
Support per-user overrides *on top of* roles. It's the pressure valve that prevents role explosion: you grant one exception instead of cloning a role.
Use a domain.action naming scheme with wildcards. audit.read reads at a glance, and audit.* future-proofs the role bundle when you add audit.export later.
Cache the resolved decision, but invalidate aggressively. A permission change a user can't feel until a 5-minute TTL expires is a support ticket — and, for a revoke, a small security gap.
That last point bit me once and is worth dwelling on: an early version invalidated the permission cache and pushed the live "refetch your permissions" event as fire-and-forget, *after* returning 200 to the admin who made the change. The admin saw success, but affected users kept their stale cached permissions until the TTL lapsed. The fix was to await invalidation before responding — the admin save is deliberately synchronous, so paying ~100ms to guarantee correctness is the right trade. In a regulated system, a revoke that doesn't take effect promptly isn't a UX nit; it's a control that didn't fire.
3. The compliance module
This is the part that justified building bespoke software at all. The compliance module implements the firm's full Belgian AML (AWW) obligations as a first-class workflow, not a folder of PDFs. It is by far the most complex area of the system — a multi-stage lifecycle per client, fed by live business-registry lookups and external screening services, all of it producing durable evidence.
3.1 The AML lifecycle: identify, verify, screen, score
For each client the firm onboards, the module walks a defined lifecycle and records the outcome of every step. The shape of it:
- Identification. Capture the client and its representatives, then validate the legal entity against Belgium's company register (KBO/BCE) and its VAT identity against the EU's VIES service. A registry mismatch is a signal, not a paperwork error — so the lookup result is stored, not just consumed.
- Ultimate beneficial owners (UBO). Resolve and record the people who ultimately own or control the entity, cross-checked against the UBO register. This is where shell structures reveal themselves, so the ownership chain is captured as evidence.
- Screening. Run each relevant person against politically-exposed-person (PEP) lists, international sanctions lists, and adverse-media sources. Screening is not one-and-done: it reruns on a schedule (ongoing vigilance), because a clean client today can appear on a list tomorrow.
- Risk scoring. Combine the above into a structured risk assessment across four pillars — client-related, sector, geography, and service — yielding a risk level (e.g. standard vs. enhanced due diligence) with every contributing factor recorded.
- Output & monitoring. Generate the engagement documentation, schedule periodic rechecks, and keep an investigation file with configurable retention — so the whole assessment can be reconstructed on demand.
The external integrations were the genuinely hard engineering. Government and registry services are slow, occasionally down, and unforgiving about request shape (one of them speaks a SOAP dialect that took real effort to satisfy). I wrapped each behind a resilient client with timeouts, rate limiting, retries, and a circuit breaker, so a flaky upstream degrades one screening check rather than wedging the whole onboarding. Crucially, the *result* of each call — including failures — is persisted. In a compliance context, "we tried to screen and the sanctions service was down" is itself a fact you must be able to prove you recorded.
A risk level that's just a number is useless to an auditor. The model records each pillar's contributing factors — a high-risk NACE sector, a non-EU jurisdiction, a PEP hit on a beneficial owner — so the score is explainable after the fact. The output isn't 'risk: high'; it's 'risk: enhanced, because X, Y, Z,' with the evidence attached.
3.2 The append-only audit trail
Underneath every sensitive action is an append-only audit log. Create or change a client record, make an AML decision, upload a document, alter someone's permissions — each writes an immutable entry capturing the actor, the action, the resource (type and id), the result (success or failure), the change set, and request context like IP and user agent. Nothing in the application updates or deletes these rows; the table only grows. Retention is handled by an explicit, separate lifecycle process, never by ad-hoc edits.
Two design choices made this trustworthy and cheap. First, audit writes are non-blocking: the action enqueues an audit job (via Redis/BullMQ) and returns; a background worker persists the row. Audit logging must never fail or slow a business operation — but it must also never silently drop one, so the queue gives durability and retries without putting the database write on the request's critical path. Second, failures are logged too. A denied permission check or a rejected upload is exactly the kind of event a supervisor cares about, so the trail records the attempt, not just the success.
Don't let the trail mutate. The moment a row can be updated or deleted from the app, it stops being evidence. Make audit writes insert-only and route retention through a deliberate, logged process — not a DELETE someone can run.
Don't drop entries under load. Fire-and-forget without durability means the one record you needed is the one that vanished during a spike. A persistent queue with retries gives you non-blocking writes *and* a guarantee.
Watch for audit feedback loops. When audit events themselves emit real-time notifications, those notifications can be auditable actions — which emit again. I had to add explicit guards so audit-derived events don't re-enter the audit pipeline and flood the queue.
Capture before/after, not just 'changed'. "User X updated client Y" is nearly worthless in a review. Record the change set so the question 'what exactly changed?' has an answer years later.
4. Real-time without chaos
Three portals looking at shared data create a coherence problem: when an assistant approves a submission, the admin's queue and the client's status should both update *now*, without a refresh. The platform handles this with a Socket.io gateway and an event-driven core — 80+ typed domain events that flow whenever state changes.
The model is rooms and audiences. Connections join rooms keyed by identity and scope — a user room, a role room (admin/assistant), a client room, a company room, a period room. Every mutation emits a typed event addressed to the relevant audiences; the most common pattern is a tenant fan-out: notify the admins, the assistants, and the one specific client a change concerns, and nobody else. Because audiences are a small typed vocabulary rather than ad-hoc string concatenation, it's hard to accidentally broadcast a client's data into the wrong room.
On the receiving side, the SPA doesn't naively trust pushed payloads to mutate UI — it uses events to invalidate the right query keys, letting TanStack Query refetch authoritative data. That keeps the source of truth on the server while still feeling instant. Optimistic updates make the actor's own change appear immediately, with rollback on failure.
The subtle bug in any real-time system: you make a change, your UI updates optimistically, and then the broadcast of your own change arrives and re-renders you — a flicker, or worse, a fight with your optimistic state. Every emit carries the originating actor's id, so a client can recognize and ignore the echo of its own action while still reacting to everyone else's. Tagging events with their actor is what makes optimistic UI and live broadcast coexist.
The same event stream powers more than UI sync. Permission changes push a 'refetch your permissions' nudge to the affected user's browser so a revoke takes hold immediately. Background jobs (20+ scheduled tasks plus on-demand queues over Redis/BullMQ) drive periodic AML rechecks, reminders, and document-collection chases — the slow, scheduled spine of a compliance workflow, decoupled from the request cycle.
5. Lessons for regulated SaaS
Stripped of the Belgian specifics, here's what generalizes to any vertical where the software has to defend its own decisions — fintech, healthtech, legaltech, anything supervised.
- One API, many surfaces beats many apps. Role-scoped portals over a single type-safe contract gave me one place for every business rule and one audit trail regardless of who acted. Three separate apps would have triplicated the logic and fractured the evidence.
- Compliance is an evidence pipeline, not a feature. The deliverable isn't 'we screened the client' — it's 'here is the durable, reconstructable record that we screened, when, against what, and with what result.' Design for the auditor's question 'show me,' and persist intermediate results, including failures.
- Make the audit trail append-only and non-blocking. Insert-only rows with before/after change sets, written through a durable queue so logging never blocks or silently drops. If a row can be edited from the app, it has stopped being evidence.
- Permissions are data; resolve them in tiers. Fine-grained
domain.actionpermissions, composed into roles, with per-user overrides as the pressure valve against role explosion. Cache the resolution, but invalidate it the instant it changes — a stale revoke is a control that didn't fire. - Treat external compliance services as hostile infrastructure. Registry and screening APIs are slow and flaky. Wrap each in timeouts, rate limits, retries, and a circuit breaker so one bad upstream degrades a single check, not the whole onboarding — and record the outcome either way.
- Tag every real-time event with its actor. It's the small detail that lets optimistic UI and live cross-portal broadcast coexist without echo flicker, and it doubles as provenance for the event log.
- Type-safety at the boundary pays compound interest. When one record is read by three portals and written by one service,
Prisma-generated types plusZod/class-validatorat the edge turn a whole class of cross-portal shape bugs into compile errors. - You don't need a team to ship agency-scale regulated software. This was a solo build — three portals, a full AML module, real-time collaboration, an audit trail — taken to production. The leverage came from ruthless consistency: one contract, one set of patterns, one source of truth.
The firm now runs its fiscal periods, document collection, and AML obligations on a single system it owns outright — one that can answer *show me the record* in seconds instead of a frantic search through inboxes. That's the whole point of compliance-grade software: not that it's prettier than spreadsheets, but that it can prove what happened.
From concept to creation let's make it happen.
I'm available for full-time roles & freelance projects.
I thrive on crafting dynamic web applications, and delivering seamless user experiences.