Pre-Selection¶

Assemble data cleaning and asset filtering into a single sklearn Pipeline.

The pre-selection module takes a raw return DataFrame, cleans it (validation, outlier treatment, imputation), and then progressively narrows the asset universe through a series of skfolio selectors. The result is a tidy, NaN-free DataFrame containing only the assets that pass every filter -- ready to feed into moment estimation and portfolio optimization.

Overview¶

The module follows the same frozen dataclass config + factory function pattern used throughout the optimizer library:

Component	Role
`PreSelectionConfig`	Frozen `@dataclass` holding every pipeline parameter as a plain primitive, enum, or `None`. Serialisable and suitable for hyperparameter sweeps.
`build_preselection_pipeline()`	Factory function that reads a `PreSelectionConfig` and returns a fully assembled `sklearn.pipeline.Pipeline`.

Because the config stores only primitives, it can be serialised to JSON/YAML, persisted to a database, or passed across process boundaries without issue. Non-serialisable objects (such as the sector_mapping dictionary) are passed as keyword arguments to the factory, not stored in the config.

from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

config = PreSelectionConfig(correlation_threshold=0.90, top_k=30)
pipeline = build_preselection_pipeline(config, sector_mapping={"AAPL": "Tech", "JPM": "Financials"})

clean_returns = pipeline.fit_transform(returns_df)

Pipeline Steps¶

build_preselection_pipeline assembles the following steps in this exact order. The first six steps are always present; the last three are conditional on config flags.

validate --> outliers --> impute --> SelectComplete --> DropZeroVariance
--> DropCorrelated --> [SelectKExtremes] --> [SelectNonDominated]
--> [SelectNonExpiring]

Steps in brackets are optional and only added when the corresponding config parameter is set.

1. `validate` -- DataValidator¶

Replaces inf, -inf, and returns whose absolute value exceeds max_abs_return with NaN. This is a stateless transformer that acts as a first-pass sanity check, catching data errors (e.g. a return of 50 000%) before they corrupt downstream statistics.

Parameter	Config field	Default
`max_abs_return`	`max_abs_return`	`10.0` (i.e. 1 000%)

Why so generous?

The default threshold of 10.0 (1 000%) is deliberately high. It catches obvious data errors while preserving legitimate large moves such as penny-stock spikes or circuit-breaker events. Tighten it to 5.0 or lower for conservative universes.

2. `outliers` -- OutlierTreater¶

Three-group z-score methodology applied per-column:

Group	Condition	Action
Data errors	`\|z\| >= remove_threshold`	Replaced with `NaN`
Outliers	`winsorize_threshold <= \|z\| < remove_threshold`	Winsorised to `mu +/- winsorize_threshold * sigma`
Normal	`\|z\| < winsorize_threshold`	Kept as-is

The z-scores are computed from the training data statistics (mu_ and sigma_ stored during fit). Constant-variance columns (sigma = 0) are assigned a z-score of 0 and left for DropZeroVariance to handle.

Parameter	Config field	Default
`winsorize_threshold`	`winsorize_threshold`	`3.0`
`remove_threshold`	`remove_threshold`	`10.0`

Validation constraint

winsorize_threshold must be strictly less than remove_threshold. The config raises ValueError at construction time if this invariant is violated.

3. `impute` -- SectorImputer¶

Fills remaining NaN values using leave-one-out sector cross-sectional averages. For each timestep and each missing cell, the imputer computes the mean of all other assets in the same sector. When the entire sector is NaN for a given row, it falls back to the global cross-sectional mean.

When sector_mapping is None, all assets are treated as a single sector, which reduces to plain global cross-sectional mean imputation.

Parameter	Config field	Default
`fallback_strategy`	`imputation_fallback`	`"global_mean"`
`sector_mapping`	Factory kwarg (not in config)	`None`

sector_mapping is a factory argument

The sector mapping is a dict[str, str] passed directly to build_preselection_pipeline(sector_mapping=...), not stored in the frozen config. This keeps the config serialisable. Columns absent from the mapping are assigned to a catch-all "__unmapped__" sector.

4. `select_complete` -- SelectComplete¶

Drops any asset (column) that still contains NaN after imputation. In practice, when SectorImputer runs correctly this step is a no-op, but it acts as a safety net to guarantee a fully complete matrix for downstream selectors that cannot handle missing data.

This step has no configurable parameters.

5. `drop_zero_variance` -- DropZeroVariance¶

Drops any asset with zero variance (constant return series). Constant columns add no information and cause numerical issues in covariance estimation.

This step has no configurable parameters.

6. `drop_correlated` -- DropCorrelated¶

Drops one asset from each pair whose pairwise correlation exceeds the threshold. This reduces redundancy in the universe and improves conditioning of the covariance matrix.

Parameter	Config field	Default
`threshold`	`correlation_threshold`	`0.95`
`absolute`	`correlation_absolute`	`False`

Absolute correlation

When correlation_absolute=True, the selector uses |corr| rather than raw correlation, so that strong negative correlations are also flagged. This is useful when you want to reduce all forms of linear dependence.

7. `select_k` -- SelectKExtremes (optional)¶

Only added when top_k is not None. Keeps the k assets with the highest (or lowest) mean return, as measured by SelectKExtremes.

Parameter	Config field	Default
`k`	`top_k`	`None` (step omitted)
`highest`	`top_k_highest`	`True`

8. `select_pareto` -- SelectNonDominated (optional)¶

Only added when use_pareto=True. Applies a Pareto non-dominance filter across risk-return dimensions, retaining only assets that lie on the efficient frontier of mean return vs. variance.

Parameter	Config field	Default
`min_n_assets`	`pareto_min_assets`	`None`

9. `select_non_expiring` -- SelectNonExpiring (optional)¶

Only added when both use_non_expiring=True and expiration_lookahead is not None. Removes assets that expire within the specified lookahead window, which is relevant for futures and options universes.

Parameter	Config field	Default
`expiration_lookahead`	`expiration_lookahead`	`None` (step omitted)

Both flags required

Setting use_non_expiring=True without providing expiration_lookahead silently skips this step. The step is only added when both conditions are met.

Configuration Reference¶

All fields of PreSelectionConfig with their types, defaults, and the pipeline step they control:

Field	Type	Default	Pipeline step	Description
`max_abs_return`	`float`	`10.0`	`validate`	Maximum absolute return before treating as data error
`winsorize_threshold`	`float`	`3.0`	`outliers`	Z-score boundary between normal observations and outliers
`remove_threshold`	`float`	`10.0`	`outliers`	Z-score boundary between outliers and data errors
`outlier_method`	`str`	`"time_series"`	`outliers`	Outlier detection approach (only `"time_series"` supported)
`imputation_fallback`	`str`	`"global_mean"`	`impute`	Fallback when sector data unavailable
`correlation_threshold`	`float`	`0.95`	`drop_correlated`	Pairwise correlation above which an asset is dropped
`correlation_absolute`	`bool`	`False`	`drop_correlated`	Whether to use absolute correlation values
`top_k`	`int \| None`	`None`	`select_k`	If set, keep only the k assets with highest/lowest mean return
`top_k_highest`	`bool`	`True`	`select_k`	Select highest (`True`) or lowest (`False`) mean return
`use_pareto`	`bool`	`False`	`select_pareto`	Whether to apply Pareto non-dominance filter
`pareto_min_assets`	`int \| None`	`None`	`select_pareto`	Minimum assets to retain after Pareto filtering
`use_non_expiring`	`bool`	`False`	`select_non_expiring`	Whether to remove soon-expiring assets
`expiration_lookahead`	`int \| None`	`None`	`select_non_expiring`	Calendar days to look ahead for expiring assets
`is_log_normal`	`bool`	`True`	(stored for downstream use)	Whether returns are assumed log-normal for multi-period scaling

Validation rules¶

The config validates the following constraints at construction time (__post_init__):

winsorize_threshold < remove_threshold -- winsorisation boundary must be stricter than the removal boundary.
0.0 < correlation_threshold <= 1.0 -- must be a valid correlation value.
max_abs_return > 0 -- must be strictly positive.

Violating any of these raises ValueError immediately.

Presets¶

PreSelectionConfig provides two class-method presets for common scenarios.

`for_daily_annual()`¶

Sensible defaults for daily equity returns over an approximately one-year horizon. This is equivalent to PreSelectionConfig() with all defaults.

cfg = PreSelectionConfig.for_daily_annual()
# max_abs_return=10.0, winsorize_threshold=3.0, remove_threshold=10.0,
# correlation_threshold=0.95, is_log_normal=True
# No optional steps (top_k, pareto, non_expiring all off)

`for_conservative()`¶

Tighter filters for a more conservative universe. Lowers the data-error and outlier thresholds, tightens the correlation filter, and activates SelectKExtremes to cap the universe at 50 assets.

cfg = PreSelectionConfig.for_conservative()
# max_abs_return=5.0, winsorize_threshold=2.5, remove_threshold=8.0,
# correlation_threshold=0.85, top_k=50, top_k_highest=True,
# is_log_normal=True

Code Examples¶

Basic usage with default config¶

from skfolio.datasets import load_sp500_dataset
from skfolio.preprocessing import prices_to_returns
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

# Load data and convert to returns
prices = load_sp500_dataset()
returns = prices_to_returns(prices)

# Build pipeline with sensible defaults
config = PreSelectionConfig.for_daily_annual()
pipeline = build_preselection_pipeline(config)

# Fit and transform
clean_returns = pipeline.fit_transform(returns)
print(f"Input: {returns.shape[1]} assets -> Output: {clean_returns.shape[1]} assets")

Conservative preset with sector-aware imputation¶

from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

# Sector mapping for imputation
sector_mapping = {
    "AAPL": "Technology",
    "MSFT": "Technology",
    "JPM": "Financials",
    "BAC": "Financials",
    "JNJ": "Healthcare",
    "PFE": "Healthcare",
    # ... more tickers
}

config = PreSelectionConfig.for_conservative()
pipeline = build_preselection_pipeline(config, sector_mapping=sector_mapping)
clean_returns = pipeline.fit_transform(returns)

Custom configuration¶

from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

config = PreSelectionConfig(
    max_abs_return=5.0,              # Strict data-error threshold
    winsorize_threshold=2.5,         # Tighter winsorisation
    remove_threshold=8.0,            # Lower removal boundary
    correlation_threshold=0.90,      # Drop assets correlated above 90%
    correlation_absolute=True,       # Use |corr| (catches negative correlation too)
    top_k=30,                        # Keep top 30 by mean return
    top_k_highest=True,              # Highest mean return
    use_pareto=True,                 # Apply Pareto filter after top-k
    pareto_min_assets=15,            # Keep at least 15 assets from Pareto
)

pipeline = build_preselection_pipeline(config)
clean_returns = pipeline.fit_transform(returns)

Futures universe with expiration filtering¶

from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

config = PreSelectionConfig(
    use_non_expiring=True,
    expiration_lookahead=90,  # Drop contracts expiring within 90 days
)

pipeline = build_preselection_pipeline(config)
clean_returns = pipeline.fit_transform(futures_returns)

Inspecting and tuning pipeline parameters¶

pipeline = build_preselection_pipeline()

# List all accessible parameters
params = pipeline.get_params()
for key in sorted(params):
    if "__" in key:
        print(f"  {key} = {params[key]}")

# Modify parameters after construction
pipeline.set_params(
    outliers__winsorize_threshold=2.5,
    drop_correlated__threshold=0.90,
    validate__max_abs_return=5.0,
)

Using pre-selection inside a full optimization pipeline¶

from skfolio.preprocessing import prices_to_returns
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
from optimizer.pipeline import run_full_pipeline

prices = ...  # pd.DataFrame of asset prices

# Pre-selection is handled internally by run_full_pipeline,
# but you can also run it explicitly for inspection:
config = PreSelectionConfig(correlation_threshold=0.90, top_k=50)
preselection_pipe = build_preselection_pipeline(config)

returns = prices_to_returns(prices)
clean_returns = preselection_pipe.fit_transform(returns)
print(f"Selected {clean_returns.shape[1]} assets from {returns.shape[1]}")

Gotchas¶

Pre-selection must run inside CV folds

When using cross-validation (walk-forward, CPCV, etc.), the pre-selection pipeline must be part of the overall sklearn pipeline that gets re-fit on each training fold. If you run pre-selection once on the full dataset and then cross-validate, you introduce data leakage -- the OutlierTreater z-score statistics and DropCorrelated correlation matrix will have been computed on data that includes the validation period.

The optimizer library handles this correctly when the pre-selection pipeline is composed inside the broader sklearn Pipeline that run_full_pipeline builds.

Parameter names use double-underscore notation

All transformer hyper-parameters are accessible via get_params() using sklearn's step_name__param_name notation. For example:

validate__max_abs_return
outliers__winsorize_threshold
outliers__remove_threshold
drop_correlated__threshold
drop_correlated__absolute
select_k__k (only when top_k is set)

This is the notation you must use for set_params() and for hyperparameter tuning grids.

prices_to_returns runs outside the pipeline

The pre-selection pipeline operates on a return DataFrame, not a price DataFrame. The conversion from prices to returns (skfolio.preprocessing.prices_to_returns) changes data semantics and is therefore performed upstream, before the pipeline runs. This is a project-wide convention.

SelectNonExpiring requires both flags

Setting use_non_expiring=True alone does not add the step. You must also provide expiration_lookahead (an integer number of calendar days). Without it, the step is silently skipped.

The config is frozen

PreSelectionConfig is a frozen dataclass. You cannot mutate fields after construction. To change a parameter, create a new config instance:

# This raises AttributeError:
config.correlation_threshold = 0.85

# Do this instead:
from dataclasses import replace
new_config = replace(config, correlation_threshold=0.85)

Quick Reference¶

from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline

# Presets
cfg = PreSelectionConfig.for_daily_annual()    # sensible defaults
cfg = PreSelectionConfig.for_conservative()    # tighter filters, top_k=50

# Factory
pipe = build_preselection_pipeline(config=cfg, sector_mapping=None)

# Pipeline step names (default)
# validate -> outliers -> impute -> select_complete -> drop_zero_variance -> drop_correlated

# Optional steps (added when config flags are set)
# select_k            (top_k is not None)
# select_pareto       (use_pareto=True)
# select_non_expiring (use_non_expiring=True AND expiration_lookahead is not None)

# Key parameter paths for tuning
# validate__max_abs_return
# outliers__winsorize_threshold
# outliers__remove_threshold
# drop_correlated__threshold
# drop_correlated__absolute
# select_k__k
# select_k__highest

Pre-Selection¶

Overview¶

Pipeline Steps¶

1. validate -- DataValidator¶

2. outliers -- OutlierTreater¶

3. impute -- SectorImputer¶

4. select_complete -- SelectComplete¶

5. drop_zero_variance -- DropZeroVariance¶

6. drop_correlated -- DropCorrelated¶

7. select_k -- SelectKExtremes (optional)¶

8. select_pareto -- SelectNonDominated (optional)¶

9. select_non_expiring -- SelectNonExpiring (optional)¶