preprocessing¶
optimizer.preprocessing
¶
Custom sklearn-compatible preprocessing transformers.
SectorImputer
¶
Bases: BaseEstimator, TransformerMixin
Fill NaN values using sector cross-sectional averages.
For each timestep (row), missing values in a given asset column are
replaced with the mean of all other assets in the same sector for
that timestep (leave-one-out sector average). If the entire sector is
NaN for a timestep, the global cross-sectional mean is used as a
fallback.
When sector_mapping is None, all assets are treated as
belonging to a single sector — effectively a global cross-sectional
mean imputation.
Parameters¶
sector_mapping : dict[str, str] or None, default=None
Maps column name (ticker) → sector label. Columns absent from
the mapping are assigned to a "__unmapped__" catch-all sector.
fallback_strategy : str, default="global_mean"
What to do when the sector has no data for a timestep. Currently
only "global_mean" is supported.
OutlierTreater
¶
Bases: BaseEstimator, TransformerMixin
Three-group outlier methodology on per-column z-scores.
During fit, compute per-column mean (mu_) and standard deviation
(sigma_) from the training data.
During transform, classify each observation into one of three groups
based on its z-score z = (x - mu) / sigma:
- Data errors —
|z| > remove_threshold→ replaced withNaN. - Outliers —
winsorize_threshold <= |z| <= remove_threshold→ winsorised tomu ± winsorize_threshold * sigma. - Normal —
|z| < winsorize_threshold→ kept as-is.
Parameters¶
winsorize_threshold : float, default=3.0 Z-score boundary between normal observations and outliers. remove_threshold : float, default=10.0 Z-score boundary between outliers and data errors.
RegressionImputer
¶
Bases: BaseEstimator, TransformerMixin
Fill NaN values using OLS regression from top-K correlated assets.
For each asset with missing data, fits a linear regression over the
training window using the n_neighbors most correlated assets as
predictors::
r_{i,t} = α + Σ_j β_j · r_{j,t} + ε_{i,t}
This preserves the covariance structure of imputed values better than sector-mean imputation.
Cold-start handling: if fewer than min_train_periods complete
observations exist for an asset in training, the asset falls back to
the fallback strategy at transform time. The same fallback applies
per-row when any neighbor is itself NaN at the imputation timestep.
Parameters¶
n_neighbors : int, default=5
Number of most-correlated assets used as regression predictors.
min_train_periods : int, default=60
Minimum complete-row count required to fit the OLS regression for
an asset. Assets below this threshold use the fallback strategy.
fallback : str, default="sector_mean"
Imputation strategy when regression is unavailable. Only
"sector_mean" is currently supported (delegates to
:class:SectorImputer).
sector_mapping : dict[str, str] or None, default=None
Maps ticker → sector label. Passed to the internal
:class:SectorImputer used for fallback imputation. When
None, the fallback uses a global cross-sectional mean.
DataValidator
¶
Bases: BaseEstimator, TransformerMixin
Replace infinities and extreme values with NaN.
Operates on a return DataFrame. Designed as the first step in a pre-selection pipeline so that downstream transformers receive well-formed numeric data.
Parameters¶
max_abs_return : float, default=10.0
Any return whose absolute value exceeds this threshold is replaced
with NaN. The default of 10.0 (i.e. 1 000 %) is deliberately
generous — it catches data errors while preserving legitimate
large moves.
apply_delisting_returns(returns, delisting_returns)
¶
Replace each ticker's last valid return with its delisting return.
This prevents survivorship bias by incorporating the terminal return that investors would have experienced when a stock was delisted.
Parameters¶
returns : pd.DataFrame Dates x tickers return matrix. delisting_returns : dict[str, float] Mapping of ticker to delisting return value. Each ticker's last valid (non-NaN) return is replaced with this value.
Returns¶
pd.DataFrame A copy of returns with delisting returns applied.
Raises¶
DataError If a ticker in delisting_returns is not in returns columns.