preprocessing¶

`optimizer.preprocessing` ¶

Custom sklearn-compatible preprocessing transformers.

`SectorImputer` ¶

Bases: BaseEstimator, TransformerMixin

Fill NaN values using sector cross-sectional averages.

For each timestep (row), missing values in a given asset column are replaced with the mean of all other assets in the same sector for that timestep (leave-one-out sector average). If the entire sector is NaN for a timestep, the global cross-sectional mean is used as a fallback.

When sector_mapping is None, all assets are treated as belonging to a single sector — effectively a global cross-sectional mean imputation.

Parameters¶

sector_mapping : dict[str, str] or None, default=None Maps column name (ticker) → sector label. Columns absent from the mapping are assigned to a "__unmapped__" catch-all sector. fallback_strategy : str, default="global_mean" What to do when the sector has no data for a timestep. Currently only "global_mean" is supported.

`fit(X, y=None)` ¶

Build the internal sector → columns index.

`transform(X)` ¶

Fill NaN with leave-one-out sector averages.

`get_feature_names_out(input_features=None)` ¶

Return feature names (pass-through).

`OutlierTreater` ¶

Bases: BaseEstimator, TransformerMixin

Three-group outlier methodology on per-column z-scores.

During fit, compute per-column mean (mu_) and standard deviation (sigma_) from the training data.

During transform, classify each observation into one of three groups based on its z-score z = (x - mu) / sigma:

Data errors — |z| > remove_threshold → replaced with NaN.
Outliers — winsorize_threshold <= |z| <= remove_threshold → winsorised to mu ± winsorize_threshold * sigma.
Normal — |z| < winsorize_threshold → kept as-is.

Parameters¶

winsorize_threshold : float, default=3.0 Z-score boundary between normal observations and outliers. remove_threshold : float, default=10.0 Z-score boundary between outliers and data errors.

`fit(X, y=None)` ¶

Compute per-column mean and std from training data.

`transform(X)` ¶

Apply three-group treatment based on z-scores.

`get_feature_names_out(input_features=None)` ¶

Return feature names (pass-through).

`RegressionImputer` ¶

Bases: BaseEstimator, TransformerMixin

Fill NaN values using OLS regression from top-K correlated assets.

For each asset with missing data, fits a linear regression over the training window using the n_neighbors most correlated assets as predictors::

r_{i,t} = α + Σ_j β_j · r_{j,t} + ε_{i,t}

This preserves the covariance structure of imputed values better than sector-mean imputation.

Cold-start handling: if fewer than min_train_periods complete observations exist for an asset in training, the asset falls back to the fallback strategy at transform time. The same fallback applies per-row when any neighbor is itself NaN at the imputation timestep.

Parameters¶

n_neighbors : int, default=5 Number of most-correlated assets used as regression predictors. min_train_periods : int, default=60 Minimum complete-row count required to fit the OLS regression for an asset. Assets below this threshold use the fallback strategy. fallback : str, default="sector_mean" Imputation strategy when regression is unavailable. Only "sector_mean" is currently supported (delegates to :class:SectorImputer). sector_mapping : dict[str, str] or None, default=None Maps ticker → sector label. Passed to the internal :class:SectorImputer used for fallback imputation. When None, the fallback uses a global cross-sectional mean.

`fit(X, y=None)` ¶

Compute neighbor rankings and fit per-asset OLS regressions.

Parameters¶

X : pd.DataFrame Asset return DataFrame (dates × assets). May contain NaN. y : ignored

Returns¶

self

`transform(X)` ¶

Impute NaN values using fitted regressions and/or fallback.

Parameters¶

X : pd.DataFrame Asset return DataFrame (dates × assets). May contain NaN.

Returns¶

pd.DataFrame Copy of X with NaN values filled.

`get_feature_names_out(input_features=None)` ¶

Return feature names (pass-through).

`DataValidator` ¶

Bases: BaseEstimator, TransformerMixin

Replace infinities and extreme values with NaN.

Operates on a return DataFrame. Designed as the first step in a pre-selection pipeline so that downstream transformers receive well-formed numeric data.

Parameters¶

max_abs_return : float, default=10.0 Any return whose absolute value exceeds this threshold is replaced with NaN. The default of 10.0 (i.e. 1 000 %) is deliberately generous — it catches data errors while preserving legitimate large moves.

`fit(X, y=None)` ¶

Store metadata. This transformer is stateless.

`transform(X)` ¶

Replace inf / -inf and extreme returns with NaN.

`get_feature_names_out(input_features=None)` ¶

Return feature names (pass-through).

`apply_delisting_returns(returns, delisting_returns)` ¶

Replace each ticker's last valid return with its delisting return.

This prevents survivorship bias by incorporating the terminal return that investors would have experienced when a stock was delisted.

Parameters¶

returns : pd.DataFrame Dates x tickers return matrix. delisting_returns : dict[str, float] Mapping of ticker to delisting return value. Each ticker's last valid (non-NaN) return is replaced with this value.

Returns¶

pd.DataFrame A copy of returns with delisting returns applied.

Raises¶

DataError If a ticker in delisting_returns is not in returns columns.

preprocessing¶

optimizer.preprocessing ¶

SectorImputer ¶

Parameters¶

fit(X, y=None) ¶

transform(X) ¶

get_feature_names_out(input_features=None) ¶

OutlierTreater ¶

Parameters¶

fit(X, y=None) ¶

transform(X) ¶

get_feature_names_out(input_features=None) ¶

RegressionImputer ¶

Parameters¶

fit(X, y=None) ¶

Parameters¶

Returns¶

transform(X) ¶

Parameters¶

Returns¶

get_feature_names_out(input_features=None) ¶

DataValidator ¶

Parameters¶

fit(X, y=None) ¶

transform(X) ¶

get_feature_names_out(input_features=None) ¶

apply_delisting_returns(returns, delisting_returns) ¶

Parameters¶

Returns¶

Raises¶

`optimizer.preprocessing` ¶

`SectorImputer` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`get_feature_names_out(input_features=None)` ¶

`OutlierTreater` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`get_feature_names_out(input_features=None)` ¶

`RegressionImputer` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`get_feature_names_out(input_features=None)` ¶

`DataValidator` ¶

`fit(X, y=None)` ¶

`transform(X)` ¶

`get_feature_names_out(input_features=None)` ¶

`apply_delisting_returns(returns, delisting_returns)` ¶