Data Python (ML) active any

Prevent leakage in Python ML splitting and preprocessing

python-ml-001

Intent

Prevent target, time, group, and preprocessing leakage in Python ML workflows so evaluation stays trustworthy.

Applicability

Applies to scikit-learn and similar Python training code, notebooks, and pipelines. Return unknown when splitting and transform fitting happen entirely outside the diff.

What to inspect

train_test_split, GroupKFold, time-based splits, Pipeline or ColumnTransformer, target-derived features, augmentation, imputers, scalers, encoders, and feature selectors.

Pass criteria

Code splits before fitting learned transforms, uses grouped or chronological splits where the data demands it, keeps augmentation and resampling on training data only, and packages preprocessing with the estimator so serving and evaluation use the same fitted path.

Fail criteria

Transformers or selectors fit on all rows before splitting, holdout examples are augmented or oversampled, grouped entities cross partitions, future-only features are present, or inference manually reimplements training transforms.

Do not flag

Pure rule-based feature extraction, stateless text cleanup, or visible shared training utilities that already enforce correct split-fit order.

Confidence guidance

HIGH when .fit() or fit_transform() touches all data before the split or when grouped or time-aware splitting is clearly missing. MEDIUM when a helper may wrap the correct pipeline. LOW when only config is visible.

Remediation

Move split logic ahead of any learned preprocessing, wrap preprocessing and model steps in one fitted pipeline, and use group-aware or time-aware split APIs when the data is correlated.

Pass example

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)
model = make_pipeline(StandardScaler(), LogisticRegression())
model.fit(X_train, y_train)

Fail example

X_all = StandardScaler().fit_transform(X)
selected = SelectKBest(k=20).fit_transform(X_all, y)
X_train, X_test, y_train, y_test = train_test_split(selected, y)

Sources

  • Machine Learning Engineering — Andriy Burkov book
  • Common pitfalls and recommended practices article