Keep missing values distinguishable from real values

data-004

Intent

Avoid collapsing missingness into a valid business or model value.

Applicability

Applies to feature engineering, ETL cleanup, and imputation logic. Return unknown when missing-value handling is hidden.

What to inspect

Imputers, fill defaults, sentinel values, missingness indicator columns, and schema nullability.

Pass criteria

Missing values remain distinguishable through explicit nulls, sentinels outside the valid domain, or paired missingness indicators.

Fail criteria

Missing values are replaced with ordinary real values such as 0, empty strings, or valid categories without any way to tell they were absent.

Do not flag

Domains where the replacement value is provably impossible and is documented as a sentinel.

Confidence guidance

HIGH when a valid domain value is used as an unmarked fill. MEDIUM when the valid range is inferred. LOW when downstream indicators may exist out of scope.

Remediation

Preserve nullability or add explicit missingness markers so models and business logic can distinguish absent from observed data.

Pass example

df["age_missing"] = df["age"].isna()
df["age"] = df["age"].fillna(df["age"].median())

Fail example

df["age"] = df["age"].fillna(0)

Sources

  • Designing Machine Learning Systems — Chip Huyen, 2022 book