Guardrails Evaluation System Design
Local mirror of Confluence page
Guardrails Evaluation System Design.Confluence page ID:
6060376591Parent folder ID:6018662571Remote version:1Last remote update:2026-06-20T10:23:47.793ZSync status: Published to Confluence
Status
Draft for implementation planning.
Context
Palisade is the edge guardrails runtime. It decides whether AI traffic should be allowed, flagged, remediated, or blocked.
This design covers the evaluation system around that runtime. The evaluation system helps the team prove whether guardrail modules, policy profiles, prompts, and judge models behave well before they are used in enforcement.
MLflow is the proposed evaluation platform because it can track datasets, runs, metrics, parameters, artifacts, traces, and LLM evaluation results in one place. It is not the runtime enforcement plane for Palisade.
Goals
- Store versioned evaluation datasets for text, image, and audio guardrail cases.
- Run repeatable evaluations against a mock target first and a Cloudflare-hosted Palisade endpoint later.
- Use an LLM judge with explicit rubrics for all use cases.
- Record predictions, judge outputs, metrics, artifacts, prompts, target versions, and dataset versions in MLflow.
- Compare guardrail quality across modules, policy profiles, modalities, and prototype versions.
- Keep the first implementation small enough to support a PoC.
Non-Goals
- Deploy MLflow in this change.
- Add dataset upload or evaluation scripts in this change.
- Replace Palisade runtime observability.
- Treat LLM judge scores as clinical, regulatory, or production approval by themselves.
- Store raw PHI, secrets, or unrestricted user data in MLflow.
System Overview
The PoC should start with a local MLflow server and local evaluation runner. Once the evaluation workflow is useful, the team can validate whether MLflow itself should run on Cloudflare.
Evaluation Lifecycle
MVP Architecture
The first implementation should be local-first:
| Component | MVP choice | Later option |
|---|---|---|
| MLflow server | Local MLflow tracking server | Cloudflare-hosted MLflow if storage and auth are validated |
| Evaluation runner | Local script | CI job or Cloudflare-triggered batch process |
| Target | mock or local function adapter | Cloudflare Palisade endpoint adapter |
| Dataset storage | Local dataset folder logged to MLflow | Governed object storage |
| Artifact storage | Local or MLflow-managed artifact store | R2 if S3 compatibility works for MLflow |
| Judge | Local LLM call wrapped by evaluator code | MLflow GenAI scorer/custom judge if useful |
This keeps the first PoC focused on evaluation quality, not infrastructure.
Current PoC Source-Of-Truth Decision
The local PoC should treat MLflow as the durable evaluation source of truth after upload:
- SQL tracking metadata stores experiments, runs, params, metrics, tags, and dataset records where the current workflow supports MLflow datasets.
- Artifact storage keeps the uploaded dataset manifests, small media fixtures, predictions, judge outputs, failures, summaries, and evaluation reports.
- Local dataset folders are upload sources, not the long-term reference once MLflow has captured the dataset and artifacts.
- The first committed fixtures should stay small and synthetic: four text cases, four audio cases, and four image cases are enough for smoke validation.
- Larger media corpora should remain local or external until the team explicitly decides they are governed, useful, and small enough to commit or move into approved artifact storage.
This keeps the evaluation workflow inspectable in the MLflow UI while avoiding repository bloat and premature infrastructure choices.
Cloudflare MLflow Deployment Direction
MLflow is a Python web application with a tracking server, backend metadata store, and artifact store. Cloudflare Workers alone are not the right fit for hosting the MLflow server. Cloudflare Containers are the likely Cloudflare-first candidate because they are designed for existing container images and custom runtimes.
Candidate deployment shape:
- Run the MLflow tracking server from a container image.
- Put the MLflow UI/API behind Cloudflare Access.
- Use a database-backed MLflow backend store for experiments, runs, metrics, tags, and traces.
- Use an object artifact store for datasets, predictions, judge outputs, and reports.
- Use Wrangler and Cloudflare container configuration once the account and product availability are confirmed.
Storage questions to validate before implementation:
- MLflow commonly uses SQL-backed metadata stores such as PostgreSQL. D1 should not be assumed compatible until a working SQLAlchemy path is proven.
- R2 has an S3-compatible API and is a good artifact-store candidate, but MLflow client behavior with the R2 endpoint, credentials, signed URLs, and large multimodal artifacts must be tested.
- If the backend store needs PostgreSQL, decide whether to use an approved managed PostgreSQL service, Cloudflare Hyperdrive to an external database, or another Abbott-approved option.
- Confirm whether Cloudflare Containers, R2, Access, and any database path are approved for the data classification used in the PoC.
Dataset Convention
Future scripts should accept one dataset folder per evaluation dataset:
datasets/
palisade-safety-smoke/
dataset.yaml
text.csv
images/
manifest.csv
case-001.png
audio/
manifest.csv
case-010.wav
dataset.yaml should describe the dataset:
| Field | Purpose |
|---|---|
dataset_id | Stable local identifier |
name | Human-readable dataset name |
version | Dataset version, controlled by the author |
owner | Team or person accountable for the dataset |
description | What the dataset is meant to prove |
allowed_data_classification | Safety boundary for the dataset |
Each case row should use the same logical fields across modalities:
| Field | Required | Purpose |
|---|---|---|
case_id | Yes | Stable case identifier |
modality | Yes | text, image, audio, or multimodal |
input_text | For text | Text prompt or content to evaluate |
input_ref | For files | Relative path to image or audio asset |
expected_output | Optional | Expected remediation or response |
ground_truth_label | Yes | Expected guardrail outcome |
policy_profile | Yes | Palisade policy profile to evaluate |
rubric_id | Yes | Judge rubric to apply |
metadata | Optional | JSON string for case attributes |
Images and audio should use a manifest file plus local assets. Text can use text.csv or text.jsonl.
Future Script Contracts
The dataset upload script should:
- Accept a dataset folder path.
- Validate required metadata and manifest fields.
- Verify referenced image and audio files exist.
- Register or log the dataset in MLflow.
- Log source manifests and assets as MLflow artifacts.
- Record dataset version, hash, owner, modality counts, and policy profiles as MLflow tags or params.
The evaluation runner should:
- Accept a dataset ID or folder path.
- Accept a target adapter:
mock,local, or futurecloudflare_endpoint. - Accept a rubric config and experiment name.
- Run each dataset case through the selected target.
- Run the LLM judge against the target output and ground truth.
- Log predictions, judge JSON, failures, aggregate metrics, and sampled artifacts to MLflow.
The target adapter should hide where Palisade runs:
| Adapter | Purpose |
|---|---|
mock | Prove the evaluation loop without a live service |
local | Evaluate a local prototype function or process |
cloudflare_endpoint | Call a future deployed Palisade Worker or API endpoint |
LLM Judge Design
Rubrics should be defined from code or config, with one system prompt per rubric. The judge should always return structured JSON so the evaluator can aggregate results without parsing prose.
Minimum judge output:
| Field | Purpose |
|---|---|
verdict | pass, fail, or needs_review |
score | Numeric quality score, for example 1 to 5 |
expected_guardrail_outcome | Ground truth outcome used by the case |
actual_guardrail_outcome | Outcome returned by the target |
error_categories | List such as false_positive, false_negative, bad_remediation, or judge_uncertain |
rationale | Short reason for the verdict |
confidence | Judge confidence from 0 to 1 |
policy_references | Policy or rubric sections used |
Judge prompt rules:
- Include the policy objective, allowed outcomes, scoring scale, and failure categories.
- Tell the judge to prefer
needs_reviewwhen the case is ambiguous. - Keep the rationale short and avoid revealing hidden chain-of-thought.
- Log the rubric ID, rubric version, judge model, and prompt version with every run.
For the PoC, run the judge locally from the evaluation runner and log its outputs to MLflow. MLflow GenAI scorers or custom judges can be evaluated later if they reduce local code without hiding important behavior.
Initial Metrics
The first dashboard should answer whether the guardrails are good enough to continue prototyping:
| Metric | Meaning |
|---|---|
| Outcome accuracy | Percent of cases where actual outcome matches ground truth |
| False positive rate | Safe cases incorrectly flagged, remediated, or blocked |
| False negative rate | Unsafe cases incorrectly allowed |
| Remediation success rate | Cases where remediation meets the rubric |
| Judge pass rate | Percent of cases the LLM judge marks as passing |
| Needs-review rate | Percent of cases the judge cannot confidently score |
| Modality breakdown | Metrics by text, image, audio, and multimodal case |
| Policy breakdown | Metrics by policy profile and module |
| Dataset coverage | Number of cases by modality, policy, and rubric |
| Cost and latency | Placeholder until live targets and judge providers are known |
Review Workflow
- Dataset owner adds or updates cases locally.
- Evaluation owner uploads the dataset to MLflow.
- Evaluation owner runs the target adapter and judge.
- Team reviews aggregate metrics and sampled failures in MLflow.
- Failures are assigned to one of four actions: fix policy, fix prompt, fix target implementation, or fix dataset/rubric.
- A passing run becomes validation evidence for the tested policy profile and target version.
Risks and Decisions
| Risk or decision | Current stance |
|---|---|
| PHI and sensitive data | Keep PoC datasets synthetic or de-identified until governance approves otherwise |
| MLflow on Cloudflare | Validate with Containers before committing |
| Backend metadata store | Do not assume D1 compatibility; validate SQLAlchemy requirements first |
| Artifact store | R2 is a candidate because of S3 compatibility, but must be tested with MLflow |
| LLM judge reliability | Use clear rubrics, confidence, needs-review, and sampled human review |
| Multimodal size | Start with small smoke datasets before storing large media artifacts |
| No live endpoint yet | Use mock and local adapters first |
Open Questions Before Implementation
- Which data classification is allowed for the first multimodal evaluation datasets?
- Which LLM provider and model should be used as the judge for the PoC?
- Should the first evaluation target be a mocked Palisade response, a local prototype function, or both?
- Which MLflow backend store is approved if MLflow runs beyond a local machine?
- Should the evaluation runner live inside this repo or in a separate validation repo once implementation starts?