Skip to content
Back to Blog

Engineering Blog

Context-Aware Compliance Intelligence
Precise Guidance With Strong Controls for Sensitive Data

Nimesh PatelApril 8, 202610 min read

Can you provide a business with context-aware, precise compliance guidance, while keeping sensitive data out of the guidance model?

Short answer: yes. Longer answer: it took six layers of data defense, a context-aware retrieval engine, and more paranoia than any of us expected. This is a look at how BizNerva's compliance AI actually works, including the architecture decisions, the trade-offs, and the stuff we got wrong before we got it right.


Why This Is Hard (And Why Spreadsheets Don't Cut It)

Picture a mid-size company operating in three U.S. states. On any given Tuesday, they might need to track California's workplace violence prevention law, federal OSHA recordkeeping, state-by-state wage requirements, and privacy obligations, all at once. Each framework has its own deadlines, evidence retention windows, training mandates, and audit expectations.

Most companies handle this with a mix of spreadsheets, calendar reminders, and optimism. The lucky ones hire a compliance officer. Everyone else just hopes they don't get audited.

We wanted to build something that actually solves this: an AI assistant that knows your specific compliance posture and tells you exactly what to fix next. Not generic advice. Not a ChatGPT wrapper with a compliance skin. Not "you should probably have an OSHA plan." Specific, grounded, actionable guidance referencing your actual data.

But here's the problem nobody talks about. To give you that kind of guidance, the AI needs to see your compliance data: incident reports, training records, hazard assessments, employee participation logs. That data can include PII and other sensitive information, such as names, SSNs, medical details, and witness narratives. Sending all of that raw to a language model was never an option for us.

16Regulatory Modules
100sPII Fields Mapped
6Defense Layers
17Security Steps

The Architecture at a Glance

Every arrow in this diagram represents a data boundary we had to think carefully about.

Layer 1 — Client-Side Security Gate

File Size + Type CheckPer-context limits
PII Pre-ScreenPattern detection on text files
Binary File WarningUser chooses OCR mode

Layer 2 — Server-Side Validation

Magic Bytes & MIMEFile integrity verification
Virus ScanMalware detection (fail-closed)
Content DisarmStrip macros, JS, EXIF

Layer 3 — PII-Safe Context-Aware AI

Deep Context RetrievalModule-aware, tenant-scoped
Hybrid RAG SearchVector similarity + Full-text fusion

6-Layer PII RedactionPre-screen → Sensitivity classifier → Field nulling → Regex scrub → JSON walk → LLM gate

Token Budget ControllerFixed budget, prioritized trim
LLMSees only redacted, scoped context

Part 1: The Six Walls Between Your PII and Our AI

When someone uploads an incident report that says "Jane Doe, SSN 123-45-6789, witnessed the altercation," we need the AI to understand "there was a workplace violence incident with a witness" without ever learning Jane's name or Social Security number.

We don't rely on one filter for this. We stack six, each one a safety net for the layers above it. If any single wall has a gap, the next one catches it.

  1. 1Client-Side Pre-Screen. Before a text-readable file ever leaves the browser, we scan it for common PII patterns. If we detect a match, the upload is blocked and the user sees exactly what was found and why. They can remove the sensitive data and re-upload, or choose not to upload at all. For binary files (PDFs, images), this layer can't inspect the content. Those are caught by the server-side sensitivity classifier and redaction layers after text extraction.
  2. 2Sensitivity Classifier. A fast, lightweight LLM reads the beginning of the document and classifies it: public, internal, confidential, or restricted. Restricted documents, such as those containing raw SSNs, medical diagnoses, or bank account numbers, are blocked by policy. They never reach the guidance model. Each layer has its own failure mode tuned to the risk it guards against. Some fail toward caution, others toward availability. Getting those defaults right matters more than you'd think.
  3. 3Structured Field Nulling. We maintain an explicit map of hundreds of PII fields across dozens of database tables. When we pull your compliance records to build AI context, every mapped field (names, emails, phone numbers, witness lists, employee IDs) gets set to null before it enters the prompt. This isn't pattern matching. It's a hand-curated, table-by-table field registry that evolves with every schema migration.
  4. 4Regex Text Scrub. Free-form text (document content, user chat messages) gets pattern-matched for common PII types. Each match is replaced with a redaction placeholder. Best-effort, yes, but it catches the common patterns that structured nulling can't reach.
  5. 5Deep JSON Walk. Our AI pipeline produces intermediate artifacts like classifier outputs, extraction results, and decision traces. Before any of these are written to the database, we recursively walk every JSON object and apply text-level scrubbing to every string leaf. No PII leaks into our audit trail.
  6. 6The LLM Client Gate. The final wall for text-based AI calls. Every chat message, system prompt, conversation context, and uploaded document content passes through PII redaction inside our LLM client wrapper before reaching the model. For binary documents (scanned PDFs, images), OCR extraction requires the model to see the raw content. That's the tradeoff we put in the user's hands via the Quality vs. PII-Safe mode choice described below. But once text is extracted, the gate catches it.

Why six layers instead of one good one?

Because no single layer can be perfect. Regex misses novel PII formats. Field maps miss free-text. Classifiers have confidence thresholds. Each layer catches what the others miss. We run hundreds of automated PII-safety tests against all six layers in CI, on every pull request and every deploy. A regression in any one layer lights up the build before it gets anywhere near production.

One design choice worth calling out: encryption at rest is separate from all of this. PII columns in our database are encrypted at rest with automatic encryption triggers. The six walls above are about what the AI sees. Encryption is about what happens if someone gets access to the database itself. They're complementary, not redundant.

The OCR Tradeoff: Quality vs. Zero Exposure

There's one scenario where the six walls face a genuine tension: scanned PDFs and images. A photograph of an incident report can't be redacted before extraction because the PII is embedded in pixels, not text. To extract text from these documents, an AI vision model needs to see the raw image.

We decided this tradeoff belongs to the user, not to us. When you upload documents for AI review, you choose between two extraction modes:

Text-based PDFs (the majority of compliance documents) work identically in both modes since the text layer is extracted locally, no AI needed. The choice only matters for scanned or image-based files. And regardless of which mode you pick, the extracted text still passes through the server-side redaction layers (sensitivity classification, regex scrub, and the LLM client gate) before reaching the guidance model.


Part 2: Context-Aware Retrieval. Not Just RAG, But the Right RAG

"You should have an OSHA safety plan" is useless. What a compliance manager actually needs to hear is: "Your Form 300A annual summary is 12 days overdue. You have 3 stale evidence items in your NERC CIP program. And California just proposed an amendment to SB 553 that may change your hazard assessment frequency."

Getting to that level of specificity took three things working together.

Module Detection via Keyword Maps

When a user asks a question, we detect which compliance modules are relevant using lightweight pattern matching. "What's our WVPP training rate?" matches the workplace violence prevention module. "Are we current on CIP requirements?" matches NERC. No expensive inference call needed for routing.

Deep Context Retrieval

Once we know which modules are relevant, we pull the organization's actual data for those modules: compliance plans, open tasks, evidence items with freshness status, hazard assessments, training completion rates, incident history. Every query is explicitly scoped by organization ID with strict tenant isolation.

The retrieved data goes through the server-side PII redaction layers before it's assembled into the prompt. By the time the LLM sees it, it knows your WVPP plan is in version 3 and was approved on January 15th, but it has no idea who approved it or who the safety coordinator is.

Hybrid Search: Vectors Meet Full-Text

For document retrieval from the evidence vault, we use a technique called Reciprocal Rank Fusion (RRF). It combines two search strategies:

RRF ranks each result set independently and then merges them. A document that scores well in both strategies floats to the top. The result is more robust retrieval than either approach alone.

Token Budget Controller

You can't dump everything into the prompt. We allocate a context budget across multiple sources and intelligently trim when space is tight, ensuring the most relevant information always makes it into the prompt.

The net effect: the LLM sees a precise, PII-scrubbed, budget-controlled snapshot of your compliance reality, scoped to the modules relevant to your question. It answers with grounded references to relevant requirements and your current compliance records.


What We Got Wrong Along the Way

We underestimated the PII surface area. Our first PII field map was embarrassingly small. By the time we audited every migration file, it had grown by an order of magnitude. PII hides in fields you don't expect: a completer's title on a violence incident log, an escort name on a physical security access record, payout details on a partner record. The lesson: PII mapping isn't a one-time exercise. It has to evolve with your schema.

One security gate, not multiple. We originally built PII pre-screening into individual upload components. When a new upload path was added, it didn't get the check. We learned the hard way: one shared function that every upload component calls, enforcing every step in order. Now every upload entry point goes through a single gate covering file size, file type, PII scan, and binary file warning. No component can skip a step.


Where We're Headed

The foundation is solid, but we're nowhere near done. We're improving retrieval accuracy with hybrid search tuning, expanding PII detection patterns beyond regex, and building deeper cross-module context so the AI can connect dots between your OSHA training records and your WVPP hazard assessments without ever seeing who was involved.

We're also building agentic capabilities. Today the AI tells you what needs attention. Soon it will be able to act on your behalf: drafting remediation plans, scheduling reviews, filing reports, updating requirement statuses. Always with your explicit approval and full transparency into what it's doing and why. The goal is not to replace your compliance team, but to give them an assistant that can execute, not just advise.

Making AI genuinely useful for compliance without compromising on data safety is core to how we build. Every layer, every decision, every line of code.


Interested in how we monitor regulatory changes across federal and state sources in real time? Read Regulations Change. Your Compliance Shouldn't Fall Behind.

See it in action

See how BizNerva's compliance AI works for your business. No contracts, no pressure, just a walkthrough with our team.

BizNerva is built by MorPhoe Tech Inc., compliance operations software for businesses that refuse to leave compliance to chance.