Skip to main content

Guardrails Evaluation System Design

Local mirror of Confluence page Guardrails Evaluation System Design.

Confluence page ID: 6060376591 Parent folder ID: 6018662571 Remote version: 1 Last remote update: 2026-06-20T10:23:47.793Z Sync status: Published to Confluence

Status

Draft for implementation planning.

Context

Palisade is the edge guardrails runtime. It decides whether AI traffic should be allowed, flagged, remediated, or blocked.

This design covers the evaluation system around that runtime. The evaluation system helps the team prove whether guardrail modules, policy profiles, prompts, and judge models behave well before they are used in enforcement.

MLflow is the proposed evaluation platform because it can track datasets, runs, metrics, parameters, artifacts, traces, and LLM evaluation results in one place. It is not the runtime enforcement plane for Palisade.

Goals

  • Store versioned evaluation datasets for text, image, and audio guardrail cases.
  • Run repeatable evaluations against a mock target first and a Cloudflare-hosted Palisade endpoint later.
  • Use an LLM judge with explicit rubrics for all use cases.
  • Record predictions, judge outputs, metrics, artifacts, prompts, target versions, and dataset versions in MLflow.
  • Compare guardrail quality across modules, policy profiles, modalities, and prototype versions.
  • Keep the first implementation small enough to support a PoC.

Non-Goals

  • Deploy MLflow in this change.
  • Add dataset upload or evaluation scripts in this change.
  • Replace Palisade runtime observability.
  • Treat LLM judge scores as clinical, regulatory, or production approval by themselves.
  • Store raw PHI, secrets, or unrestricted user data in MLflow.

System Overview

The PoC should start with a local MLflow server and local evaluation runner. Once the evaluation workflow is useful, the team can validate whether MLflow itself should run on Cloudflare.

Evaluation Lifecycle

MVP Architecture

The first implementation should be local-first:

ComponentMVP choiceLater option
MLflow serverLocal MLflow tracking serverCloudflare-hosted MLflow if storage and auth are validated
Evaluation runnerLocal scriptCI job or Cloudflare-triggered batch process
Targetmock or local function adapterCloudflare Palisade endpoint adapter
Dataset storageLocal dataset folder logged to MLflowGoverned object storage
Artifact storageLocal or MLflow-managed artifact storeR2 if S3 compatibility works for MLflow
JudgeLocal LLM call wrapped by evaluator codeMLflow GenAI scorer/custom judge if useful

This keeps the first PoC focused on evaluation quality, not infrastructure.

Current PoC Source-Of-Truth Decision

The local PoC should treat MLflow as the durable evaluation source of truth after upload:

  • SQL tracking metadata stores experiments, runs, params, metrics, tags, and dataset records where the current workflow supports MLflow datasets.
  • Artifact storage keeps the uploaded dataset manifests, small media fixtures, predictions, judge outputs, failures, summaries, and evaluation reports.
  • Local dataset folders are upload sources, not the long-term reference once MLflow has captured the dataset and artifacts.
  • The first committed fixtures should stay small and synthetic: four text cases, four audio cases, and four image cases are enough for smoke validation.
  • Larger media corpora should remain local or external until the team explicitly decides they are governed, useful, and small enough to commit or move into approved artifact storage.

This keeps the evaluation workflow inspectable in the MLflow UI while avoiding repository bloat and premature infrastructure choices.

Cloudflare MLflow Deployment Direction

MLflow is a Python web application with a tracking server, backend metadata store, and artifact store. Cloudflare Workers alone are not the right fit for hosting the MLflow server. Cloudflare Containers are the likely Cloudflare-first candidate because they are designed for existing container images and custom runtimes.

Candidate deployment shape:

  • Run the MLflow tracking server from a container image.
  • Put the MLflow UI/API behind Cloudflare Access.
  • Use a database-backed MLflow backend store for experiments, runs, metrics, tags, and traces.
  • Use an object artifact store for datasets, predictions, judge outputs, and reports.
  • Use Wrangler and Cloudflare container configuration once the account and product availability are confirmed.

Storage questions to validate before implementation:

  • MLflow commonly uses SQL-backed metadata stores such as PostgreSQL. D1 should not be assumed compatible until a working SQLAlchemy path is proven.
  • R2 has an S3-compatible API and is a good artifact-store candidate, but MLflow client behavior with the R2 endpoint, credentials, signed URLs, and large multimodal artifacts must be tested.
  • If the backend store needs PostgreSQL, decide whether to use an approved managed PostgreSQL service, Cloudflare Hyperdrive to an external database, or another Abbott-approved option.
  • Confirm whether Cloudflare Containers, R2, Access, and any database path are approved for the data classification used in the PoC.

Dataset Convention

Future scripts should accept one dataset folder per evaluation dataset:

datasets/
palisade-safety-smoke/
dataset.yaml
text.csv
images/
manifest.csv
case-001.png
audio/
manifest.csv
case-010.wav

dataset.yaml should describe the dataset:

FieldPurpose
dataset_idStable local identifier
nameHuman-readable dataset name
versionDataset version, controlled by the author
ownerTeam or person accountable for the dataset
descriptionWhat the dataset is meant to prove
allowed_data_classificationSafety boundary for the dataset

Each case row should use the same logical fields across modalities:

FieldRequiredPurpose
case_idYesStable case identifier
modalityYestext, image, audio, or multimodal
input_textFor textText prompt or content to evaluate
input_refFor filesRelative path to image or audio asset
expected_outputOptionalExpected remediation or response
ground_truth_labelYesExpected guardrail outcome
policy_profileYesPalisade policy profile to evaluate
rubric_idYesJudge rubric to apply
metadataOptionalJSON string for case attributes

Images and audio should use a manifest file plus local assets. Text can use text.csv or text.jsonl.

Future Script Contracts

The dataset upload script should:

  • Accept a dataset folder path.
  • Validate required metadata and manifest fields.
  • Verify referenced image and audio files exist.
  • Register or log the dataset in MLflow.
  • Log source manifests and assets as MLflow artifacts.
  • Record dataset version, hash, owner, modality counts, and policy profiles as MLflow tags or params.

The evaluation runner should:

  • Accept a dataset ID or folder path.
  • Accept a target adapter: mock, local, or future cloudflare_endpoint.
  • Accept a rubric config and experiment name.
  • Run each dataset case through the selected target.
  • Run the LLM judge against the target output and ground truth.
  • Log predictions, judge JSON, failures, aggregate metrics, and sampled artifacts to MLflow.

The target adapter should hide where Palisade runs:

AdapterPurpose
mockProve the evaluation loop without a live service
localEvaluate a local prototype function or process
cloudflare_endpointCall a future deployed Palisade Worker or API endpoint

LLM Judge Design

Rubrics should be defined from code or config, with one system prompt per rubric. The judge should always return structured JSON so the evaluator can aggregate results without parsing prose.

Minimum judge output:

FieldPurpose
verdictpass, fail, or needs_review
scoreNumeric quality score, for example 1 to 5
expected_guardrail_outcomeGround truth outcome used by the case
actual_guardrail_outcomeOutcome returned by the target
error_categoriesList such as false_positive, false_negative, bad_remediation, or judge_uncertain
rationaleShort reason for the verdict
confidenceJudge confidence from 0 to 1
policy_referencesPolicy or rubric sections used

Judge prompt rules:

  • Include the policy objective, allowed outcomes, scoring scale, and failure categories.
  • Tell the judge to prefer needs_review when the case is ambiguous.
  • Keep the rationale short and avoid revealing hidden chain-of-thought.
  • Log the rubric ID, rubric version, judge model, and prompt version with every run.

For the PoC, run the judge locally from the evaluation runner and log its outputs to MLflow. MLflow GenAI scorers or custom judges can be evaluated later if they reduce local code without hiding important behavior.

Initial Metrics

The first dashboard should answer whether the guardrails are good enough to continue prototyping:

MetricMeaning
Outcome accuracyPercent of cases where actual outcome matches ground truth
False positive rateSafe cases incorrectly flagged, remediated, or blocked
False negative rateUnsafe cases incorrectly allowed
Remediation success rateCases where remediation meets the rubric
Judge pass ratePercent of cases the LLM judge marks as passing
Needs-review ratePercent of cases the judge cannot confidently score
Modality breakdownMetrics by text, image, audio, and multimodal case
Policy breakdownMetrics by policy profile and module
Dataset coverageNumber of cases by modality, policy, and rubric
Cost and latencyPlaceholder until live targets and judge providers are known

Review Workflow

  1. Dataset owner adds or updates cases locally.
  2. Evaluation owner uploads the dataset to MLflow.
  3. Evaluation owner runs the target adapter and judge.
  4. Team reviews aggregate metrics and sampled failures in MLflow.
  5. Failures are assigned to one of four actions: fix policy, fix prompt, fix target implementation, or fix dataset/rubric.
  6. A passing run becomes validation evidence for the tested policy profile and target version.

Risks and Decisions

Risk or decisionCurrent stance
PHI and sensitive dataKeep PoC datasets synthetic or de-identified until governance approves otherwise
MLflow on CloudflareValidate with Containers before committing
Backend metadata storeDo not assume D1 compatibility; validate SQLAlchemy requirements first
Artifact storeR2 is a candidate because of S3 compatibility, but must be tested with MLflow
LLM judge reliabilityUse clear rubrics, confidence, needs-review, and sampled human review
Multimodal sizeStart with small smoke datasets before storing large media artifacts
No live endpoint yetUse mock and local adapters first

Open Questions Before Implementation

  • Which data classification is allowed for the first multimodal evaluation datasets?
  • Which LLM provider and model should be used as the judge for the PoC?
  • Should the first evaluation target be a mocked Palisade response, a local prototype function, or both?
  • Which MLflow backend store is approved if MLflow runs beyond a local machine?
  • Should the evaluation runner live inside this repo or in a separate validation repo once implementation starts?

References