Pre-Selection¶
Assemble data cleaning and asset filtering into a single sklearn Pipeline.
The pre-selection module takes a raw return DataFrame, cleans it (validation,
outlier treatment, imputation), and then progressively narrows the asset
universe through a series of skfolio selectors. The result is a tidy,
NaN-free DataFrame containing only the assets that pass every filter --
ready to feed into moment estimation and portfolio optimization.
Overview¶
The module follows the same frozen dataclass config + factory function pattern used throughout the optimizer library:
| Component | Role |
|---|---|
PreSelectionConfig |
Frozen @dataclass holding every pipeline parameter as a plain primitive, enum, or None. Serialisable and suitable for hyperparameter sweeps. |
build_preselection_pipeline() |
Factory function that reads a PreSelectionConfig and returns a fully assembled sklearn.pipeline.Pipeline. |
Because the config stores only primitives, it can be serialised to JSON/YAML,
persisted to a database, or passed across process boundaries without issue.
Non-serialisable objects (such as the sector_mapping dictionary) are passed
as keyword arguments to the factory, not stored in the config.
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
config = PreSelectionConfig(correlation_threshold=0.90, top_k=30)
pipeline = build_preselection_pipeline(config, sector_mapping={"AAPL": "Tech", "JPM": "Financials"})
clean_returns = pipeline.fit_transform(returns_df)
Pipeline Steps¶
build_preselection_pipeline assembles the following steps in this exact
order. The first six steps are always present; the last three are
conditional on config flags.
validate --> outliers --> impute --> SelectComplete --> DropZeroVariance
--> DropCorrelated --> [SelectKExtremes] --> [SelectNonDominated]
--> [SelectNonExpiring]
Steps in brackets are optional and only added when the corresponding config parameter is set.
1. validate -- DataValidator¶
Replaces inf, -inf, and returns whose absolute value exceeds
max_abs_return with NaN. This is a stateless transformer that acts as a
first-pass sanity check, catching data errors (e.g. a return of 50 000%)
before they corrupt downstream statistics.
| Parameter | Config field | Default |
|---|---|---|
max_abs_return |
max_abs_return |
10.0 (i.e. 1 000%) |
Why so generous?
The default threshold of 10.0 (1 000%) is deliberately high. It catches obvious data errors while preserving legitimate large moves such as penny-stock spikes or circuit-breaker events. Tighten it to 5.0 or lower for conservative universes.
2. outliers -- OutlierTreater¶
Three-group z-score methodology applied per-column:
| Group | Condition | Action |
|---|---|---|
| Data errors | |z| >= remove_threshold |
Replaced with NaN |
| Outliers | winsorize_threshold <= |z| < remove_threshold |
Winsorised to mu +/- winsorize_threshold * sigma |
| Normal | |z| < winsorize_threshold |
Kept as-is |
The z-scores are computed from the training data statistics (mu_ and
sigma_ stored during fit). Constant-variance columns (sigma = 0) are
assigned a z-score of 0 and left for DropZeroVariance to handle.
| Parameter | Config field | Default |
|---|---|---|
winsorize_threshold |
winsorize_threshold |
3.0 |
remove_threshold |
remove_threshold |
10.0 |
Validation constraint
winsorize_threshold must be strictly less than remove_threshold.
The config raises ValueError at construction time if this invariant is
violated.
3. impute -- SectorImputer¶
Fills remaining NaN values using leave-one-out sector cross-sectional
averages. For each timestep and each missing cell, the imputer computes the
mean of all other assets in the same sector. When the entire sector is
NaN for a given row, it falls back to the global cross-sectional mean.
When sector_mapping is None, all assets are treated as a single sector,
which reduces to plain global cross-sectional mean imputation.
| Parameter | Config field | Default |
|---|---|---|
fallback_strategy |
imputation_fallback |
"global_mean" |
sector_mapping |
Factory kwarg (not in config) | None |
sector_mapping is a factory argument
The sector mapping is a dict[str, str] passed directly to
build_preselection_pipeline(sector_mapping=...), not stored in the
frozen config. This keeps the config serialisable. Columns absent from
the mapping are assigned to a catch-all "__unmapped__" sector.
4. select_complete -- SelectComplete¶
Drops any asset (column) that still contains NaN after imputation. In
practice, when SectorImputer runs correctly this step is a no-op, but it
acts as a safety net to guarantee a fully complete matrix for downstream
selectors that cannot handle missing data.
This step has no configurable parameters.
5. drop_zero_variance -- DropZeroVariance¶
Drops any asset with zero variance (constant return series). Constant columns add no information and cause numerical issues in covariance estimation.
This step has no configurable parameters.
6. drop_correlated -- DropCorrelated¶
Drops one asset from each pair whose pairwise correlation exceeds the threshold. This reduces redundancy in the universe and improves conditioning of the covariance matrix.
| Parameter | Config field | Default |
|---|---|---|
threshold |
correlation_threshold |
0.95 |
absolute |
correlation_absolute |
False |
Absolute correlation
When correlation_absolute=True, the selector uses |corr| rather than
raw correlation, so that strong negative correlations are also flagged.
This is useful when you want to reduce all forms of linear dependence.
7. select_k -- SelectKExtremes (optional)¶
Only added when top_k is not None. Keeps the k assets with the highest
(or lowest) mean return, as measured by SelectKExtremes.
| Parameter | Config field | Default |
|---|---|---|
k |
top_k |
None (step omitted) |
highest |
top_k_highest |
True |
8. select_pareto -- SelectNonDominated (optional)¶
Only added when use_pareto=True. Applies a Pareto non-dominance filter
across risk-return dimensions, retaining only assets that lie on the
efficient frontier of mean return vs. variance.
| Parameter | Config field | Default |
|---|---|---|
min_n_assets |
pareto_min_assets |
None |
9. select_non_expiring -- SelectNonExpiring (optional)¶
Only added when both use_non_expiring=True and
expiration_lookahead is not None. Removes assets that expire within the
specified lookahead window, which is relevant for futures and options
universes.
| Parameter | Config field | Default |
|---|---|---|
expiration_lookahead |
expiration_lookahead |
None (step omitted) |
Both flags required
Setting use_non_expiring=True without providing expiration_lookahead
silently skips this step. The step is only added when both conditions
are met.
Configuration Reference¶
All fields of PreSelectionConfig with their types, defaults, and the
pipeline step they control:
| Field | Type | Default | Pipeline step | Description |
|---|---|---|---|---|
max_abs_return |
float |
10.0 |
validate |
Maximum absolute return before treating as data error |
winsorize_threshold |
float |
3.0 |
outliers |
Z-score boundary between normal observations and outliers |
remove_threshold |
float |
10.0 |
outliers |
Z-score boundary between outliers and data errors |
outlier_method |
str |
"time_series" |
outliers |
Outlier detection approach (only "time_series" supported) |
imputation_fallback |
str |
"global_mean" |
impute |
Fallback when sector data unavailable |
correlation_threshold |
float |
0.95 |
drop_correlated |
Pairwise correlation above which an asset is dropped |
correlation_absolute |
bool |
False |
drop_correlated |
Whether to use absolute correlation values |
top_k |
int | None |
None |
select_k |
If set, keep only the k assets with highest/lowest mean return |
top_k_highest |
bool |
True |
select_k |
Select highest (True) or lowest (False) mean return |
use_pareto |
bool |
False |
select_pareto |
Whether to apply Pareto non-dominance filter |
pareto_min_assets |
int | None |
None |
select_pareto |
Minimum assets to retain after Pareto filtering |
use_non_expiring |
bool |
False |
select_non_expiring |
Whether to remove soon-expiring assets |
expiration_lookahead |
int | None |
None |
select_non_expiring |
Calendar days to look ahead for expiring assets |
is_log_normal |
bool |
True |
(stored for downstream use) | Whether returns are assumed log-normal for multi-period scaling |
Validation rules¶
The config validates the following constraints at construction time
(__post_init__):
winsorize_threshold < remove_threshold-- winsorisation boundary must be stricter than the removal boundary.0.0 < correlation_threshold <= 1.0-- must be a valid correlation value.max_abs_return > 0-- must be strictly positive.
Violating any of these raises ValueError immediately.
Presets¶
PreSelectionConfig provides two class-method presets for common scenarios.
for_daily_annual()¶
Sensible defaults for daily equity returns over an approximately one-year
horizon. This is equivalent to PreSelectionConfig() with all defaults.
cfg = PreSelectionConfig.for_daily_annual()
# max_abs_return=10.0, winsorize_threshold=3.0, remove_threshold=10.0,
# correlation_threshold=0.95, is_log_normal=True
# No optional steps (top_k, pareto, non_expiring all off)
for_conservative()¶
Tighter filters for a more conservative universe. Lowers the data-error
and outlier thresholds, tightens the correlation filter, and activates
SelectKExtremes to cap the universe at 50 assets.
cfg = PreSelectionConfig.for_conservative()
# max_abs_return=5.0, winsorize_threshold=2.5, remove_threshold=8.0,
# correlation_threshold=0.85, top_k=50, top_k_highest=True,
# is_log_normal=True
Code Examples¶
Basic usage with default config¶
from skfolio.datasets import load_sp500_dataset
from skfolio.preprocessing import prices_to_returns
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
# Load data and convert to returns
prices = load_sp500_dataset()
returns = prices_to_returns(prices)
# Build pipeline with sensible defaults
config = PreSelectionConfig.for_daily_annual()
pipeline = build_preselection_pipeline(config)
# Fit and transform
clean_returns = pipeline.fit_transform(returns)
print(f"Input: {returns.shape[1]} assets -> Output: {clean_returns.shape[1]} assets")
Conservative preset with sector-aware imputation¶
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
# Sector mapping for imputation
sector_mapping = {
"AAPL": "Technology",
"MSFT": "Technology",
"JPM": "Financials",
"BAC": "Financials",
"JNJ": "Healthcare",
"PFE": "Healthcare",
# ... more tickers
}
config = PreSelectionConfig.for_conservative()
pipeline = build_preselection_pipeline(config, sector_mapping=sector_mapping)
clean_returns = pipeline.fit_transform(returns)
Custom configuration¶
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
config = PreSelectionConfig(
max_abs_return=5.0, # Strict data-error threshold
winsorize_threshold=2.5, # Tighter winsorisation
remove_threshold=8.0, # Lower removal boundary
correlation_threshold=0.90, # Drop assets correlated above 90%
correlation_absolute=True, # Use |corr| (catches negative correlation too)
top_k=30, # Keep top 30 by mean return
top_k_highest=True, # Highest mean return
use_pareto=True, # Apply Pareto filter after top-k
pareto_min_assets=15, # Keep at least 15 assets from Pareto
)
pipeline = build_preselection_pipeline(config)
clean_returns = pipeline.fit_transform(returns)
Futures universe with expiration filtering¶
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
config = PreSelectionConfig(
use_non_expiring=True,
expiration_lookahead=90, # Drop contracts expiring within 90 days
)
pipeline = build_preselection_pipeline(config)
clean_returns = pipeline.fit_transform(futures_returns)
Inspecting and tuning pipeline parameters¶
pipeline = build_preselection_pipeline()
# List all accessible parameters
params = pipeline.get_params()
for key in sorted(params):
if "__" in key:
print(f" {key} = {params[key]}")
# Modify parameters after construction
pipeline.set_params(
outliers__winsorize_threshold=2.5,
drop_correlated__threshold=0.90,
validate__max_abs_return=5.0,
)
Using pre-selection inside a full optimization pipeline¶
from skfolio.preprocessing import prices_to_returns
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
from optimizer.pipeline import run_full_pipeline
prices = ... # pd.DataFrame of asset prices
# Pre-selection is handled internally by run_full_pipeline,
# but you can also run it explicitly for inspection:
config = PreSelectionConfig(correlation_threshold=0.90, top_k=50)
preselection_pipe = build_preselection_pipeline(config)
returns = prices_to_returns(prices)
clean_returns = preselection_pipe.fit_transform(returns)
print(f"Selected {clean_returns.shape[1]} assets from {returns.shape[1]}")
Gotchas¶
Pre-selection must run inside CV folds
When using cross-validation (walk-forward, CPCV, etc.), the
pre-selection pipeline must be part of the overall sklearn pipeline
that gets re-fit on each training fold. If you run pre-selection once on
the full dataset and then cross-validate, you introduce data leakage --
the OutlierTreater z-score statistics and DropCorrelated correlation
matrix will have been computed on data that includes the validation
period.
The optimizer library handles this correctly when the pre-selection
pipeline is composed inside the broader sklearn Pipeline that
run_full_pipeline builds.
Parameter names use double-underscore notation
All transformer hyper-parameters are accessible via get_params() using
sklearn's step_name__param_name notation. For example:
validate__max_abs_returnoutliers__winsorize_thresholdoutliers__remove_thresholddrop_correlated__thresholddrop_correlated__absoluteselect_k__k(only whentop_kis set)
This is the notation you must use for set_params() and for
hyperparameter tuning grids.
prices_to_returns runs outside the pipeline
The pre-selection pipeline operates on a return DataFrame, not a
price DataFrame. The conversion from prices to returns
(skfolio.preprocessing.prices_to_returns) changes data semantics and
is therefore performed upstream, before the pipeline runs. This is a
project-wide convention.
SelectNonExpiring requires both flags
Setting use_non_expiring=True alone does not add the step.
You must also provide expiration_lookahead (an integer number of
calendar days). Without it, the step is silently skipped.
The config is frozen
PreSelectionConfig is a frozen dataclass. You cannot mutate fields
after construction. To change a parameter, create a new config instance:
Quick Reference¶
from optimizer.pre_selection import PreSelectionConfig, build_preselection_pipeline
# Presets
cfg = PreSelectionConfig.for_daily_annual() # sensible defaults
cfg = PreSelectionConfig.for_conservative() # tighter filters, top_k=50
# Factory
pipe = build_preselection_pipeline(config=cfg, sector_mapping=None)
# Pipeline step names (default)
# validate -> outliers -> impute -> select_complete -> drop_zero_variance -> drop_correlated
# Optional steps (added when config flags are set)
# select_k (top_k is not None)
# select_pareto (use_pareto=True)
# select_non_expiring (use_non_expiring=True AND expiration_lookahead is not None)
# Key parameter paths for tuning
# validate__max_abs_return
# outliers__winsorize_threshold
# outliers__remove_threshold
# drop_correlated__threshold
# drop_correlated__absolute
# select_k__k
# select_k__highest