Preserve dataset lineage and feature provenance for production ML artifacts
data-003
Intent
Keep production ML artifacts traceable back to the exact datasets and feature definitions that produced them.
Applicability
Applies to training jobs, experiment tracking, feature registration, and model publication. Return unknown when artifact tracking is fully external.
What to inspect
Run metadata, dataset source identifiers, snapshot references, feature definition registries, and artifact manifests.
Pass criteria
Training artifacts record the dataset source or snapshot and the feature definitions or lineage needed to reproduce the run.
Fail criteria
Published or promoted artifacts lack dataset origin, feature provenance, or enough metadata to trace regressions back to source data.
Do not flag
Ephemeral exploratory notebooks that do not produce shared or production ML artifacts.
Confidence guidance
HIGH when artifact metadata visibly omits dataset or feature lineage. MEDIUM when a tracking system may capture it elsewhere. LOW when only model code changed.
Remediation
Attach dataset identifiers, source locations, version or snapshot info, and feature lineage to tracked runs and promoted artifacts.
Pass example
mlflow.log_input(dataset, context="training")
mlflow.set_tag("feature_set", "user_risk_v3")
Fail example
mlflow.sklearn.log_model(model, "model")