Guardrails Evaluation System Design

Local mirror of Confluence page Guardrails Evaluation System Design.

Confluence page ID: 6060376591 Parent folder ID: 6018662571 Remote version: 1 Last remote update: 2026-06-20T10:23:47.793Z Sync status: Published to Confluence

Status

Draft for implementation planning.

Context

Palisade is the edge guardrails runtime. It decides whether AI traffic should be allowed, flagged, remediated, or blocked.

This design covers the evaluation system around that runtime. The evaluation system helps the team prove whether guardrail modules, policy profiles, prompts, and judge models behave well before they are used in enforcement.

MLflow is the proposed evaluation platform because it can track datasets, runs, metrics, parameters, artifacts, traces, and LLM evaluation results in one place. It is not the runtime enforcement plane for Palisade.

Goals

Store versioned evaluation datasets for text, image, and audio guardrail cases.
Run repeatable evaluations against a mock target first and a Cloudflare-hosted Palisade endpoint later.
Use an LLM judge with explicit rubrics for all use cases.
Record predictions, judge outputs, metrics, artifacts, prompts, target versions, and dataset versions in MLflow.
Compare guardrail quality across modules, policy profiles, modalities, and prototype versions.
Keep the first implementation small enough to support a PoC.

Non-Goals

Deploy MLflow in this change.
Add dataset upload or evaluation scripts in this change.
Replace Palisade runtime observability.
Treat LLM judge scores as clinical, regulatory, or production approval by themselves.
Store raw PHI, secrets, or unrestricted user data in MLflow.

System Overview

The PoC should start with a local MLflow server and local evaluation runner. Once the evaluation workflow is useful, the team can validate whether MLflow itself should run on Cloudflare.

Evaluation Lifecycle

MVP Architecture

The first implementation should be local-first:

Component	MVP choice	Later option
MLflow server	Local MLflow tracking server	Cloudflare-hosted MLflow if storage and auth are validated
Evaluation runner	Local script	CI job or Cloudflare-triggered batch process
Target	`mock` or local function adapter	Cloudflare Palisade endpoint adapter
Dataset storage	Local dataset folder logged to MLflow	Governed object storage
Artifact storage	Local or MLflow-managed artifact store	R2 if S3 compatibility works for MLflow
Judge	Local LLM call wrapped by evaluator code	MLflow GenAI scorer/custom judge if useful

This keeps the first PoC focused on evaluation quality, not infrastructure.

Current PoC Source-Of-Truth Decision

The local PoC should treat MLflow as the durable evaluation source of truth after upload:

SQL tracking metadata stores experiments, runs, params, metrics, tags, and dataset records where the current workflow supports MLflow datasets.
Artifact storage keeps the uploaded dataset manifests, small media fixtures, predictions, judge outputs, failures, summaries, and evaluation reports.
Local dataset folders are upload sources, not the long-term reference once MLflow has captured the dataset and artifacts.
The first committed fixtures should stay small and synthetic: four text cases, four audio cases, and four image cases are enough for smoke validation.
Larger media corpora should remain local or external until the team explicitly decides they are governed, useful, and small enough to commit or move into approved artifact storage.

This keeps the evaluation workflow inspectable in the MLflow UI while avoiding repository bloat and premature infrastructure choices.

Cloudflare MLflow Deployment Direction

MLflow is a Python web application with a tracking server, backend metadata store, and artifact store. Cloudflare Workers alone are not the right fit for hosting the MLflow server. Cloudflare Containers are the likely Cloudflare-first candidate because they are designed for existing container images and custom runtimes.

Candidate deployment shape:

Run the MLflow tracking server from a container image.
Put the MLflow UI/API behind Cloudflare Access.
Use a database-backed MLflow backend store for experiments, runs, metrics, tags, and traces.
Use an object artifact store for datasets, predictions, judge outputs, and reports.
Use Wrangler and Cloudflare container configuration once the account and product availability are confirmed.

Storage questions to validate before implementation:

MLflow commonly uses SQL-backed metadata stores such as PostgreSQL. D1 should not be assumed compatible until a working SQLAlchemy path is proven.
R2 has an S3-compatible API and is a good artifact-store candidate, but MLflow client behavior with the R2 endpoint, credentials, signed URLs, and large multimodal artifacts must be tested.
If the backend store needs PostgreSQL, decide whether to use an approved managed PostgreSQL service, Cloudflare Hyperdrive to an external database, or another Abbott-approved option.
Confirm whether Cloudflare Containers, R2, Access, and any database path are approved for the data classification used in the PoC.

Dataset Convention

Future scripts should accept one dataset folder per evaluation dataset:

datasets/
  palisade-safety-smoke/
    dataset.yaml
    text.csv
    images/
      manifest.csv
      case-001.png
    audio/
      manifest.csv
      case-010.wav

dataset.yaml should describe the dataset:

Field	Purpose
`dataset_id`	Stable local identifier
`name`	Human-readable dataset name
`version`	Dataset version, controlled by the author
`owner`	Team or person accountable for the dataset
`description`	What the dataset is meant to prove
`allowed_data_classification`	Safety boundary for the dataset

Each case row should use the same logical fields across modalities:

Field	Required	Purpose
`case_id`	Yes	Stable case identifier
`modality`	Yes	`text`, `image`, `audio`, or `multimodal`
`input_text`	For text	Text prompt or content to evaluate
`input_ref`	For files	Relative path to image or audio asset
`expected_output`	Optional	Expected remediation or response
`ground_truth_label`	Yes	Expected guardrail outcome
`policy_profile`	Yes	Palisade policy profile to evaluate
`rubric_id`	Yes	Judge rubric to apply
`metadata`	Optional	JSON string for case attributes

Images and audio should use a manifest file plus local assets. Text can use text.csv or text.jsonl.

Future Script Contracts

The dataset upload script should:

Accept a dataset folder path.
Validate required metadata and manifest fields.
Verify referenced image and audio files exist.
Register or log the dataset in MLflow.
Log source manifests and assets as MLflow artifacts.
Record dataset version, hash, owner, modality counts, and policy profiles as MLflow tags or params.

The evaluation runner should:

Accept a dataset ID or folder path.
Accept a target adapter: mock, local, or future cloudflare_endpoint.
Accept a rubric config and experiment name.
Run each dataset case through the selected target.
Run the LLM judge against the target output and ground truth.
Log predictions, judge JSON, failures, aggregate metrics, and sampled artifacts to MLflow.

The target adapter should hide where Palisade runs:

Adapter	Purpose
`mock`	Prove the evaluation loop without a live service
`local`	Evaluate a local prototype function or process
`cloudflare_endpoint`	Call a future deployed Palisade Worker or API endpoint

LLM Judge Design

Rubrics should be defined from code or config, with one system prompt per rubric. The judge should always return structured JSON so the evaluator can aggregate results without parsing prose.

Minimum judge output:

Field	Purpose
`verdict`	`pass`, `fail`, or `needs_review`
`score`	Numeric quality score, for example 1 to 5
`expected_guardrail_outcome`	Ground truth outcome used by the case
`actual_guardrail_outcome`	Outcome returned by the target
`error_categories`	List such as `false_positive`, `false_negative`, `bad_remediation`, or `judge_uncertain`
`rationale`	Short reason for the verdict
`confidence`	Judge confidence from 0 to 1
`policy_references`	Policy or rubric sections used

Judge prompt rules:

Include the policy objective, allowed outcomes, scoring scale, and failure categories.
Tell the judge to prefer needs_review when the case is ambiguous.
Keep the rationale short and avoid revealing hidden chain-of-thought.
Log the rubric ID, rubric version, judge model, and prompt version with every run.

For the PoC, run the judge locally from the evaluation runner and log its outputs to MLflow. MLflow GenAI scorers or custom judges can be evaluated later if they reduce local code without hiding important behavior.

Initial Metrics

The first dashboard should answer whether the guardrails are good enough to continue prototyping:

Metric	Meaning
Outcome accuracy	Percent of cases where actual outcome matches ground truth
False positive rate	Safe cases incorrectly flagged, remediated, or blocked
False negative rate	Unsafe cases incorrectly allowed
Remediation success rate	Cases where remediation meets the rubric
Judge pass rate	Percent of cases the LLM judge marks as passing
Needs-review rate	Percent of cases the judge cannot confidently score
Modality breakdown	Metrics by text, image, audio, and multimodal case
Policy breakdown	Metrics by policy profile and module
Dataset coverage	Number of cases by modality, policy, and rubric
Cost and latency	Placeholder until live targets and judge providers are known

Review Workflow

Dataset owner adds or updates cases locally.
Evaluation owner uploads the dataset to MLflow.
Evaluation owner runs the target adapter and judge.
Team reviews aggregate metrics and sampled failures in MLflow.
Failures are assigned to one of four actions: fix policy, fix prompt, fix target implementation, or fix dataset/rubric.
A passing run becomes validation evidence for the tested policy profile and target version.

Risks and Decisions

Risk or decision	Current stance
PHI and sensitive data	Keep PoC datasets synthetic or de-identified until governance approves otherwise
MLflow on Cloudflare	Validate with Containers before committing
Backend metadata store	Do not assume D1 compatibility; validate SQLAlchemy requirements first
Artifact store	R2 is a candidate because of S3 compatibility, but must be tested with MLflow
LLM judge reliability	Use clear rubrics, confidence, needs-review, and sampled human review
Multimodal size	Start with small smoke datasets before storing large media artifacts
No live endpoint yet	Use `mock` and `local` adapters first

Open Questions Before Implementation

Which data classification is allowed for the first multimodal evaluation datasets?
Which LLM provider and model should be used as the judge for the PoC?
Should the first evaluation target be a mocked Palisade response, a local prototype function, or both?
Which MLflow backend store is approved if MLflow runs beyond a local machine?
Should the evaluation runner live inside this repo or in a separate validation repo once implementation starts?

Status​

Context​

Goals​

Non-Goals​

System Overview​

Evaluation Lifecycle​

MVP Architecture​

Current PoC Source-Of-Truth Decision​

Cloudflare MLflow Deployment Direction​

Dataset Convention​

Future Script Contracts​

LLM Judge Design​

Initial Metrics​

Review Workflow​

Risks and Decisions​

Open Questions Before Implementation​

References​