Can you provide a business with context-aware, grounded compliance guidance, while safeguarding the sensitive data?

Short answer: yes. Longer answer: it took six layers of data defense, a context-aware retrieval engine, and more paranoia than any of us expected. This is a look at how BizNerva's compliance AI actually works, including the architecture decisions, the trade-offs, and the stuff we got wrong before we got it right.

Why This Is Hard (And Why Spreadsheets Don't Cut It)

Picture a mid-size company operating in three U.S. states. On any given Tuesday, they might need to track California's workplace violence prevention law, federal OSHA recordkeeping, state-by-state wage requirements, and privacy obligations, all at once. Each framework has its own deadlines, evidence retention windows, training mandates, and audit expectations.

Most companies handle this with a mix of spreadsheets, calendar reminders, and optimism. The lucky ones hire a compliance officer. Everyone else just hopes they don't get audited.

We wanted to build something that actually solves this: an AI assistant that knows your specific compliance posture and tells you exactly what to fix next. Not generic advice. Not a ChatGPT wrapper with a compliance skin. Not "you should probably have an OSHA plan." Specific, grounded, actionable guidance referencing your actual data.

But here's the problem nobody talks about. To give you that kind of guidance, the AI needs to see your compliance data: incident reports, training records, hazard assessments, employee participation logs. That data can include PII and other sensitive information, such as names, SSNs, medical details, and witness narratives. Sending all of that raw to a language model was never an option for us.

MultiDomain Coverage

100sPII Fields Mapped

6Defense Layers

17Security Steps

The Architecture at a Glance

Every arrow in this diagram represents a data boundary we had to think carefully about.

Layer 1 — Client-Side Security Gate

File Size + Type CheckPer-context limits

PII Pre-ScreenPattern detection on text files

Binary File WarningUser chooses OCR mode

↓

Layer 2 — Server-Side Validation

Magic Bytes & MIMEFile integrity verification

Virus ScanMalware detection (fail-closed)

Content DisarmStrip macros, JS, EXIF

↓

Layer 3 — PII-Safe Context-Aware AI

Deep Context RetrievalModule-aware, tenant-scoped

Hybrid RAG SearchVector similarity + Full-text fusion

↓

6-Layer PII RedactionPre-screen → Sensitivity classifier → Field nulling → Regex scrub → JSON walk → LLM gate

↓

Token Budget ControllerFixed budget, prioritized trim

LLMReceives scoped context with automated redaction

Part 1: The Six Walls Between Your PII and Our AI

When someone uploads an incident report that says "Jane Doe, SSN 123-45-6789, witnessed the altercation," we need the AI to understand "there was a workplace violence incident with a witness" without that name or Social Security number reaching the model.

We don't rely on one filter for this. We stack six, each one a safety net for the layers above it. If any single wall has a gap, the next one catches it.

1Client-Side Pre-Screen. Before a text-readable file ever leaves the browser, we scan it for common PII patterns. If we detect a match, the upload may be blocked and the user sees exactly what was found and why. They can remove the sensitive data and re-upload, or choose not to upload at all. For binary files (PDFs, images), this layer can't inspect the content. Those are routed to server-side sensitivity classification and downstream redaction layers after text extraction.
2Sensitivity Classifier. A fast, lightweight model reads the document and classifies it: public, internal, confidential, or restricted. BizNerva is a workplace-compliance platform, so material that falls outside that scope is designed to be blocked at this layer rather than redacted: a document that actually contains individual student or minor records, protected health information, raw SSNs, or bank account numbers is stopped here rather than flowing into the guidance model. Each layer has its own failure mode tuned to the risk it guards against. Some fail toward caution, others toward availability. Getting those defaults right matters more than you'd think.
3Structured Field Nulling. We maintain an explicit map of PII fields across our database schema. When we pull compliance records to build AI context, supported mapped fields (names, emails, phone numbers, witness lists, employee IDs) are nulled before they enter the prompt. This isn't pattern matching. It's a hand-curated, table-by-table field registry that evolves with schema migrations.
4Regex Text Scrub. Free-form text (document content, user chat messages) gets pattern-matched for common PII types. Each match is replaced with a redaction placeholder. Best-effort, yes, but it catches the common patterns that structured nulling can't reach.
5Deep JSON Walk. Our AI pipeline produces intermediate artifacts like classifier outputs, extraction results, and decision traces. Before these artifacts are written to the database, we recursively walk JSON objects and apply text-level scrubbing to string leaves where technically supported, reducing the chance that PII lands in audit artifacts.
6The LLM Client Gate. The final wall for text-based AI calls. Chat messages, system prompts, conversation context, and extracted document text pass through pattern-based PII redaction inside our LLM client wrapper before reaching the model where technically supported. For binary documents (scanned PDFs, images), OCR extraction may require the model to see the raw content. That's the tradeoff we put in the user's hands via the Quality vs. PII-Safe mode choice described below. Once text is extracted, the gate applies pattern-based redaction before guidance calls.

Why six layers instead of one good one?

Because no single technique is perfect on its own. Each layer is designed to catch what the others might miss, so a gap in one is covered by the next. We run hundreds of automated PII-safety tests against all six layers in CI, so a regression in any one layer surfaces in the build well before it reaches production.

One design choice worth calling out: encryption at rest is separate from all of this. High-risk PII fields in our database are encrypted at rest with automatic encryption triggers. The six walls above are about what the AI sees. Encryption is about what happens if someone gets access to the database itself. They're complementary, not redundant.

That same discipline extends beyond the AI pipeline. We work to keep personal data out of our application logs, our error monitoring, and the keys we use for caching, and compliance data held in our caches is scrubbed of personal fields before it is stored. The default, wherever your data moves inside our systems, is to strip what a given system does not need to see.

The OCR Tradeoff: Quality vs. Zero Exposure

There's one scenario where the six walls face a genuine tension: scanned PDFs and images. A photograph of an incident report can't be redacted before extraction because the PII is embedded in pixels, not text. To extract text from these documents, an AI vision model needs to see the raw image.

We decided this tradeoff belongs to the user, not to us. When you upload documents for AI review, you choose between two extraction modes:

Quality mode: uses AI vision for scanned PDFs and images. Best accuracy for handwritten notes, photographed documents, and image-heavy files. The raw document is processed by the AI model during extraction, and the extracted text then passes through automated redaction and sensitivity screening before guidance calls.
PII-Safe mode: local text-layer extraction only. Extracts the text layer from PDFs without any AI involvement. No AI involvement in extraction, though accuracy depends on the quality of the document's text layer.

Text-based PDFs (the majority of compliance documents) work identically in both modes since the text layer is extracted locally, no AI needed. The choice only matters for scanned or image-based files. Regardless of which mode you pick, extracted text passes through server-side safeguards (sensitivity classification, regex scrub, and the LLM client gate) before guidance calls where technically supported.

Part 2: Context-Aware Retrieval. Not Just RAG, But the Right RAG

"You should have an OSHA safety plan" is useless. What a compliance manager actually needs to hear is: "Your Form 300A annual summary is 12 days overdue. You have 3 stale evidence items in your NERC CIP program. And California just proposed an amendment to SB 553 that may change your hazard assessment frequency."

Getting to that level of specificity took three things working together.

Module Detection via Keyword Maps

When a user asks a question, we detect which compliance modules are relevant using lightweight pattern matching. "What's our WVPP training rate?" matches the workplace violence prevention module. "Are we current on CIP requirements?" matches NERC. No expensive inference call needed for routing.

Deep Context Retrieval

Once we know which modules are relevant, we pull the organization's actual data for those modules: compliance plans, open tasks, evidence items with freshness status, hazard assessments, training completion rates, incident history. Every query is explicitly scoped by organization ID with strict tenant isolation.

Retrieved data goes through server-side PII redaction layers before it's assembled into the prompt where technically supported. The model can receive the compliance facts it needs, such as a WVPP plan version and approval date, while supported structured identifiers are suppressed.

Hybrid Search: Vectors Meet Full-Text

For document retrieval from the evidence vault, we use a technique called Reciprocal Rank Fusion (RRF). It combines two search strategies:

Vector similarity search, great for finding conceptually related documents even when the user's phrasing doesn't match the exact regulatory terminology.
Full-text search, precise when the user uses exact terms like "CIP-007-R2" or "29 CFR 1904.39".

RRF ranks each result set independently and then merges them. A document that scores well in both strategies floats to the top. The result is more robust retrieval than either approach alone.

Token Budget Controller

You can't dump everything into the prompt. We allocate a context budget across multiple sources and intelligently trim when space is tight, ensuring the most relevant information always makes it into the prompt.

The net effect: the LLM sees a precise, PII-scrubbed, budget-controlled snapshot of your compliance reality, scoped to the modules relevant to your question. It answers with grounded references to relevant requirements and your current compliance records.

What We Got Wrong Along the Way

We underestimated the PII surface area. Our first PII field map was embarrassingly small. By the time we audited every migration file, it had grown by an order of magnitude. PII hides in fields you don't expect: a completer's title on a violence incident log, an escort name on a physical security access record, payout details on a partner record. The lesson: PII mapping isn't a one-time exercise. It has to evolve with your schema.

One security gate, not multiple. We originally built PII pre-screening into individual upload components. When a new upload path was added, it didn't get the check. We learned the hard way: one shared function that upload components call, enforcing every step in order. Now supported upload paths go through a single shared gate covering file size, file type, PII scan, and binary file warning, which makes it much harder for a new upload path to quietly skip a check.

Where We're Headed

The foundation is solid, but we're nowhere near done. We're improving retrieval accuracy with hybrid search tuning, expanding PII detection patterns beyond regex, and building deeper cross-module context so the AI can connect dots between your OSHA training records and your WVPP hazard assessments without ever seeing who was involved.

We're also building agentic capabilities. Today the AI tells you what needs attention. Soon it will be able to act on your behalf: drafting remediation plans, scheduling reviews, filing reports, updating requirement statuses. Always with your explicit approval and full transparency into what it's doing and why. The goal is not to replace your compliance team, but to give them an assistant that can execute, not just advise.

Making AI genuinely useful for compliance while holding a high bar for data safety is core to how we build. Every layer, every decision, every line of code.

Interested in how we monitor regulatory changes across federal and state sources in real time? Read Regulations Change. Your Compliance Shouldn't Fall Behind.

BizNerva is built by MorPhoe Tech Inc, compliance operations software for businesses that refuse to leave compliance to chance.

Context-Aware Compliance Intelligence
Grounded Guidance With Strong Controls for Sensitive Data