Skip to content

factors

optimizer.factors

Factor construction, scoring, and selection for stock pre-selection.

CompositeMethod

Bases: str, Enum

Composite scoring method.

CompositeScoringConfig dataclass

Configuration for composite score construction.

Parameters

method : CompositeMethod Equal-weight, IC-weighted, ICIR-weighted, ridge, or GBT composite. ic_lookback : int Number of periods for IC estimation when using IC weighting. core_weight : float Relative weight for core factor groups. supplementary_weight : float Relative weight for supplementary factor groups. ridge_alpha : float L2 regularisation strength for RIDGE_WEIGHTED. Passed as the single candidate to RidgeCV; increase for more shrinkage. gbt_max_depth : int Maximum tree depth for GBT_WEIGHTED. gbt_n_estimators : int Number of boosting rounds for GBT_WEIGHTED.

for_equal_weight() classmethod

Equal-weight composite scoring.

for_ic_weighted() classmethod

IC-weighted composite scoring (raw IC magnitude).

for_icir_weighted() classmethod

ICIR-weighted composite scoring (mean IC / std IC).

Penalises inconsistent predictors by dividing mean IC by IC volatility before normalising weights.

for_ridge_weighted() classmethod

Ridge regression composite scoring.

Learns optimal linear factor weights from historical data with L2 regularisation, avoiding the need for IC proxies.

for_gbt_weighted() classmethod

Gradient-boosted tree composite scoring.

Captures non-linear factor interactions (e.g. high value + improving momentum = stronger combined signal).

FactorConstructionConfig dataclass

Configuration for factor computation.

Parameters

factors : tuple[FactorType, ...] Which factors to compute. momentum_lookback : int Lookback window for momentum in trading days. momentum_skip : int Recent days to skip for momentum (reversal avoidance). volatility_lookback : int Lookback window for volatility in trading days. beta_lookback : int Lookback window for beta estimation in trading days. amihud_lookback : int Lookback window for Amihud illiquidity in trading days. publication_lag : PublicationLagConfig Per-source publication lags for point-in-time correctness. Pass a plain int for a uniform lag across all sources (backward-compatible; converted to :class:PublicationLagConfig automatically).

for_core_factors() classmethod

Core factors with strongest empirical support.

for_all_factors() classmethod

All 17 factors.

FactorGroupType

Bases: str, Enum

Factor group taxonomy.

FactorIntegrationConfig dataclass

Configuration for bridging factor scores to optimization.

Parameters

risk_free_rate : float Annual risk-free rate for expected return mapping. market_risk_premium : float Annual equity risk premium. use_black_litterman : bool Whether to generate Black-Litterman views from factor scores. exposure_lower_bound : float Lower bound for factor exposure constraints. exposure_upper_bound : float Upper bound for factor exposure constraints.

for_linear_mapping() classmethod

Direct factor score to expected return mapping.

for_black_litterman() classmethod

Factor-based Black-Litterman views.

FactorType

Bases: str, Enum

Individual factor identifiers.

FactorValidationConfig dataclass

Configuration for factor validation and statistical testing.

Parameters

newey_west_lags : int Number of lags for Newey-West t-statistic. t_stat_threshold : float Minimum absolute t-statistic for significance. fdr_alpha : float False discovery rate alpha level. n_quantiles : int Number of quantiles for spread analysis. fmp_top_pct : float Top percentile for factor-mimicking portfolios. fmp_bottom_pct : float Bottom percentile for factor-mimicking portfolios.

for_strict() classmethod

Strict validation thresholds.

for_standard() classmethod

Standard validation thresholds.

GroupWeight

Bases: str, Enum

Weight tier for factor groups.

MacroRegime

Bases: str, Enum

Macro-economic regime classification.

PublicationLagConfig dataclass

Differentiated publication lags by data source type.

Each source has an independent delay between the period end date and the date the data is reliably available for use in factor construction. Using source-specific lags avoids look-ahead bias when aligning fundamental data to price dates.

Parameters

annual_days : int Lag for annual financial statements (days after fiscal year end). Default: 90 days (~3 months for 10-K filing). quarterly_days : int Lag for quarterly financial statements (days after quarter end). Default: 45 days (~6 weeks for 10-Q filing). analyst_days : int Lag for analyst estimates and recommendations. Default: 5 days (short dissemination buffer). macro_days : int Lag for macroeconomic indicators (release lag + revision lag). Default: 63 days (~2 months).

uniform(days) classmethod

Create a config with the same lag applied to all sources.

RegimeTiltConfig dataclass

Configuration for macro regime factor tilts.

Per-regime multiplicative tilts stored as tuples of (group_name, tilt_factor) for frozen-dataclass compatibility.

Parameters

enable : bool Whether to apply regime tilts. expansion_tilts : tuple[tuple[str, float], ...] Group tilts during expansion. slowdown_tilts : tuple[tuple[str, float], ...] Group tilts during slowdown. recession_tilts : tuple[tuple[str, float], ...] Group tilts during recession. recovery_tilts : tuple[tuple[str, float], ...] Group tilts during recovery.

for_moderate_tilts() classmethod

Enable moderate regime-conditional tilts.

for_no_tilts() classmethod

Disable regime tilts (default).

SelectionConfig dataclass

Configuration for stock selection from scored universe.

Parameters

method : SelectionMethod Fixed-count or quantile-based selection. target_count : int Number of stocks to select (for FIXED_COUNT). target_quantile : float Quantile threshold for selection (for QUANTILE, 0-1). exit_quantile : float Exit quantile for hysteresis (for QUANTILE). buffer_fraction : float Buffer zone fraction around selection boundary. sector_balance : bool Whether to enforce sector-proportional representation. sector_tolerance : float Maximum deviation from parent universe sector weights.

for_top_100() classmethod

Select top 100 stocks by composite score.

for_top_quintile() classmethod

Select top quintile by composite score.

for_concentrated() classmethod

Concentrated portfolio of top 30 stocks.

SelectionMethod

Bases: str, Enum

Stock selection method.

StandardizationConfig dataclass

Configuration for cross-sectional factor standardization.

Parameters

method : StandardizationMethod Z-score or rank-normal standardization. winsorize_lower : float Lower percentile for winsorization (0-1). winsorize_upper : float Upper percentile for winsorization (0-1). neutralize_sector : bool Whether to sector-neutralize scores. neutralize_country : bool Whether to country-neutralize scores.

for_heavy_tailed() classmethod

Rank-normal for heavy-tailed distributions (e.g. value ratios).

for_normal() classmethod

Z-score for approximately normal factors (e.g. momentum).

StandardizationMethod

Bases: str, Enum

Cross-sectional standardization method.

FactorPCAResult dataclass

Principal component analysis result for a factor score matrix.

Attributes

explained_variance_ratio : ndarray, shape (n_components,) Fraction of variance explained by each principal component, sorted in descending order. loadings : pd.DataFrame, shape (n_factors, n_components) PCA loading matrix. Rows are factor names; columns are PC1, PC2, ... . Each column is a unit eigenvector of the correlation matrix of the factor scores. n_components_95pct : int Smallest number of components whose cumulative explained variance ratio is ≥ 0.95.

FactorExposureConstraints dataclass

Enforceable linear constraints on portfolio factor exposure.

Encodes the set of per-factor inequalities::

lb_g <= sum_i w_i * z_{i,g} <= ub_g

as a pair of matrices ready to be passed directly to :class:skfolio.optimization.MeanRisk (or any optimizer that accepts left_inequality / right_inequality).

Parameters

left_inequality : np.ndarray of shape (2 * n_factors, n_assets) Inequality matrix A in the constraint A @ w <= b. Two rows per factor: -z (lower bound) and +z (upper bound). right_inequality : np.ndarray of shape (2 * n_factors,) Bound vector b in the constraint A @ w <= b. factor_names : list[str] Names of the constrained factors (in the same order as the row pairs in left_inequality). lower_bounds : np.ndarray of shape (n_factors,) Lower exposure bound per factor. upper_bounds : np.ndarray of shape (n_factors,) Upper exposure bound per factor.

NetAlphaResult dataclass

Result of net alpha calculation after transaction cost deduction.

Attributes

gross_alpha : float Annualised IC-based alpha proxy: mean(IC) * sqrt(annualisation). avg_turnover : float Mean one-way turnover across consecutive rebalancing dates, computed via :func:~optimizer.rebalancing._rebalancer.compute_turnover. total_cost : float Cost deduction: avg_turnover * cost_bps / 10_000. net_alpha : float Net annualised alpha after cost deduction: gross_alpha - total_cost. net_icir : float Net information coefficient information ratio: net_alpha / (std(IC) * sqrt(annualisation)). 0.0 when the IC series has zero variance.

QuintileSpreadResult dataclass

Quintile spread analysis result for a single factor.

Attributes

quintile_returns : pd.DataFrame Dates × Q1..Qn equal-weight portfolio returns per quantile bucket. Q1 = bottom (lowest scores), Qn = top (highest scores). spread_returns : pd.Series Qn − Q1 long-short spread return series indexed by date. Equals quintile_returns.iloc[:, -1] - quintile_returns.iloc[:, 0] element-wise. annualised_mean : float spread_returns.mean() * 252. t_stat : float Two-tailed t-statistic: mean / (std / sqrt(T)). sharpe : float Annualised Sharpe ratio: mean * sqrt(252) / std.

FactorOOSConfig dataclass

Configuration for rolling block OOS validation.

Parameters

train_months : int Length of the training window in months. Default: 36. val_months : int Length of the validation window in months. Default: 12. step_months : int Number of months to roll forward between folds. Default: 6.

FactorOOSResult dataclass

Results from rolling block OOS factor validation.

Attributes

per_fold_ic : pd.DataFrame n_folds × factors matrix of mean IC per fold per factor. per_fold_spread : pd.DataFrame n_folds × factors matrix of mean quintile spread per fold. mean_oos_ic : pd.Series Mean OOS IC aggregated across folds (one value per factor). mean_oos_icir : pd.Series OOS ICIR (mean IC / std IC across folds) per factor. n_folds : int Number of folds generated.

CorrectedPValues dataclass

Multiple-testing corrected p-values.

Attributes

holm : ndarray Holm-Bonferroni adjusted p-values (controls FWER). bh : ndarray Benjamini-Hochberg adjusted p-values (controls FDR).

FactorValidationReport dataclass

Complete validation report for all factors.

ICResult dataclass

Information coefficient analysis results for a single factor.

ICStats dataclass

Full IC statistics for a single factor including Newey-West inference.

Attributes

mean : float Mean IC over the evaluation period. variance_nw : float Newey-West HAC variance of the IC series. t_stat_nw : float Newey-West adjusted t-statistic: IC_mean / sqrt(Var_NW / T). p_value : float Two-tailed p-value derived from the Newey-West t-statistic. icir : float Information Coefficient Information Ratio: mean(IC) / std(IC).

QuantileSpreadResult dataclass

Quantile spread analysis results for a single factor.

compute_gross_alpha(net_alpha, avg_turnover, cost_bps=10.0)

Compute gross alpha by adding back estimated transaction costs.

Formula::

gross = net_alpha + avg_turnover * cost_bps / 10_000
Parameters

net_alpha : float Net alpha after transaction costs (annualised). avg_turnover : float Average one-way turnover (e.g. 0.5 means 50% of portfolio traded per period). cost_bps : float One-way transaction cost in basis points.

Returns

float Gross alpha before transaction costs.

factor_scores_to_expected_returns(scores, betas, factor_premiums, risk_free_rate=0.0)

Convert factor Z-scores to expected returns via linear model.

Implements the formula::

E[r_i] = r_f + λ_mkt · β_i + Σ_g λ_g · z_{i,g}

where λ_mkt is read from factor_premiums["market"] and each λ_g is read from factor_premiums[g] for factor group g.

Parameters

scores : pd.DataFrame Assets × factor-groups matrix of standardised Z-scores. Rows are ticker symbols; columns are factor group names (e.g. "value", "momentum"). betas : pd.Series Market (CAPM) beta per asset, indexed by ticker. Assets missing from this Series are treated as having a beta of 1.0 (market neutral assumption). factor_premiums : dict[str, float] Mapping of premium label → annualised premium (e.g. {"market": 0.05, "value": 0.03, "momentum": 0.04}). The reserved "market" key provides λ_mkt; all other keys are matched against columns in scores. risk_free_rate : float, default 0.0 Annualised risk-free rate r_f.

Returns

pd.Series Annualised expected return per ticker, indexed by scores.index.

Examples

import pandas as pd scores = pd.DataFrame( ... {"value": [1.0, -1.0], "momentum": [0.5, 0.0]}, ... index=["AAPL", "MSFT"], ... ) betas = pd.Series({"AAPL": 1.2, "MSFT": 0.8}) factor_premiums = {"market": 0.05, "value": 0.03, "momentum": 0.04} factor_scores_to_expected_returns(scores, betas, factor_premiums, 0.02) AAPL 0.132 MSFT 0.018 dtype: float64

align_to_pit(data, period_date_col, as_of_date, lag_days, ticker_col='ticker')

Filter time-series data to records published before as_of_date.

A record with period end date D is considered published lag_days calendar days after D. A record is available as of as_of_date only when D + lag_days <= as_of_date, equivalently when D <= as_of_date - lag_days.

For each ticker, the most recent record satisfying the availability constraint is returned so that callers receive a cross-sectional view as of as_of_date.

Parameters

data : pd.DataFrame Time-series data containing period_date_col and optionally ticker_col. period_date_col : str Name of the column holding the period end date. as_of_date : pd.Timestamp or str The computation date. Only records available on or before this date (after the lag has elapsed) are returned. lag_days : int Calendar days between period end and data availability. ticker_col : str Column holding the ticker identifier. Defaults to "ticker".

Returns

pd.DataFrame Cross-sectional view: one row per ticker (the most recent available record), indexed by ticker_col when present. Returns an empty DataFrame with the same columns if no records pass the cutoff.

compute_all_factors(fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None)

Compute all configured factors.

Parameters

fundamentals : pd.DataFrame Cross-sectional data indexed by ticker. price_history : pd.DataFrame Price matrix (dates x tickers). volume_history : pd.DataFrame or None Volume matrix. analyst_data : pd.DataFrame or None Analyst recommendation data. insider_data : pd.DataFrame or None Insider transaction data. config : FactorConstructionConfig or None Construction parameters.

Returns

pd.DataFrame Tickers x factors matrix.

compute_factor(factor_type, fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None)

Compute a single factor.

Parameters

factor_type : FactorType Which factor to compute. fundamentals : pd.DataFrame Cross-sectional data indexed by ticker. price_history : pd.DataFrame Price matrix (dates x tickers). volume_history : pd.DataFrame or None Volume matrix (dates x tickers). analyst_data : pd.DataFrame or None Analyst recommendation data. insider_data : pd.DataFrame or None Insider transaction data. config : FactorConstructionConfig or None Construction parameters.

Returns

pd.Series Factor values indexed by ticker.

check_survivorship_bias(returns, final_periods=12, zero_threshold=1e-10)

Check for potential survivorship bias in a return panel.

Survivorship bias occurs when delisted or failed assets are excluded from the sample. A simple heuristic: if no asset has near-zero returns in the final final_periods rows (i.e., no asset appears to have stopped trading), the panel may suffer from survivorship bias.

Parameters

returns : pd.DataFrame Dates × assets return matrix. final_periods : int Number of trailing periods to inspect. zero_threshold : float Absolute threshold below which a return is considered "zero".

Returns

bool True if survivorship bias is suspected, False otherwise.

compute_factor_pca(scores, n_components=None)

Compute PCA on a cross-sectional factor score matrix.

Rows with any NaN are dropped before fitting. Scores are standardised (zero mean, unit variance per factor) so that PCA operates on the correlation structure rather than the covariance structure.

Parameters

scores : pd.DataFrame Tickers × factors matrix of factor scores. Columns are factor names; rows are asset observations. n_components : int or None, default None Number of principal components to retain. None keeps all components (min(n_samples, n_features)).

Returns

FactorPCAResult See :class:FactorPCAResult for field descriptions.

Raises

ValueError If fewer than 2 factors or fewer than 2 observations are available after dropping NaN rows.

flag_redundant_factors(scores, vif_threshold=10.0)

Return factor names whose VIF exceeds vif_threshold.

A VIF above the threshold indicates that the factor's variance is largely explained by the remaining factors, making it a candidate for merging or removal from the composite score.

Parameters

scores : pd.DataFrame Tickers × factors matrix of factor scores. Must contain at least 2 factor columns. vif_threshold : float, default 10.0 VIF cutoff above which a factor is considered redundant. Commonly used values: 5 (conservative) or 10 (standard).

Returns

list[str] Factor names with VIF > vif_threshold, in the order they appear in scores.columns. Empty list if none exceed the threshold.

Raises

ValueError Propagated from :func:compute_vif if fewer than 2 factors are provided.

build_factor_bl_views(factor_scores, factor_premia, selected_tickers)

Generate Black-Litterman views from factor scores.

Creates relative views: top-scored assets outperform bottom-scored by the factor premium.

Parameters

factor_scores : pd.DataFrame Tickers x factors matrix of standardized scores. factor_premia : dict[str, float] Expected premium per factor. selected_tickers : pd.Index Tickers in the portfolio.

Returns

tuple[list[tuple[str, ...]], list[float]] (views, confidences) for Black-Litterman.

build_factor_exposure_constraints(factor_scores, bounds)

Build enforceable linear factor exposure constraints.

For each factor g, the constraint enforces::

lb_g <= sum_i w_i * z_{i,g} <= ub_g

The result is expressed as left_inequality @ w <= right_inequality (two rows per factor) and can be passed directly to :class:skfolio.optimization.MeanRisk via its left_inequality / right_inequality constructor arguments.

Parameters

factor_scores : pd.DataFrame Tickers x factors matrix of standardised factor scores. The tickers must match the assets used in the optimizer fit. bounds : tuple[float, float] or dict[str, tuple[float, float]] Exposure bounds applied to every factor (uniform) when given as a single (lower, upper) tuple, or per-factor bounds when given as a dict mapping factor name → (lower, upper).

Returns

FactorExposureConstraints Dataclass holding left_inequality, right_inequality, and metadata. Pass left_inequality and right_inequality as keyword arguments to the optimizer.

Warns

UserWarning If the equal-weight portfolio exposure lies outside [lb, ub] for any factor (i.e. the constraint may be infeasible or very tight under a balanced allocation).

compute_net_alpha(ic_series, weights_history, cost_bps=10.0, annualisation=252)

Compute factor net alpha after deducting turnover-based transaction costs.

Combines IC-based gross alpha with the turnover cost from a weights history to produce a single net performance metric::

gross_alpha  = mean(IC) * sqrt(annualisation)
avg_turnover = mean one-way turnover across rebalancing dates
total_cost   = avg_turnover * cost_bps / 10_000
net_alpha    = gross_alpha - total_cost
net_icir     = net_alpha / (std(IC) * sqrt(annualisation))
Parameters

ic_series : pd.Series Time series of period information coefficients (Spearman or Pearson rank correlation between factor scores and forward returns), one value per rebalancing period. weights_history : pd.DataFrame Portfolio weights at each rebalancing date: rows = dates, columns = assets. Turnover is computed between every pair of consecutive rows. cost_bps : float, default=10.0 Round-trip transaction cost in basis points. annualisation : int, default=252 Number of periods per year (252 for daily, 12 for monthly).

Returns

NetAlphaResult Dataclass with gross_alpha, avg_turnover, total_cost, net_alpha, and net_icir.

estimate_factor_premia(factor_mimicking_returns)

Estimate annualized factor premia from long-short returns.

Parameters

factor_mimicking_returns : pd.DataFrame Dates x factors matrix of factor-mimicking portfolio returns.

Returns

dict[str, float] Annualized premium per factor.

build_factor_mimicking_portfolios(scores, returns, quantile=0.3, weighting='equal', beta_neutral=False, market_returns=None)

Build long-short factor-mimicking portfolio return time series.

For each date the top quantile fraction of assets (by factor score) are held long and the bottom quantile fraction are held short. The long-short return is the equal- or value-weighted long leg minus the corresponding short leg.

The function handles one factor at a time: scores is a dates × assets DataFrame encoding cross-sectional scores for a single factor. For multiple factors, call once per factor and concatenate the results::

factor_returns = pd.concat(
    [
        build_factor_mimicking_portfolios(scores_value, returns)
            .rename(columns={"factor_return": "value"}),
        build_factor_mimicking_portfolios(scores_mom, returns)
            .rename(columns={"factor_return": "momentum"}),
    ],
    axis=1,
)
Parameters

scores : pd.DataFrame Dates × assets matrix of cross-sectional factor scores. Index = dates; columns = asset tickers. returns : pd.DataFrame Dates × assets matrix of asset returns, aligned with scores on the date index. Columns may be a superset or subset of scores columns; the intersection is used. quantile : float, default 0.30 Fraction of the asset universe assigned to each leg. Must be in (0, 0.5]. weighting : {"equal", "value"}, default "equal" Weighting scheme within each leg. "equal" — every asset in the leg receives the same weight. "value" — assets are weighted by the absolute value of their factor score. beta_neutral : bool, default False When True, hedge the long-short portfolio against market beta exposure. The hedge ratio adjusts the short-leg weight so that the portfolio beta is approximately zero. market_returns : pd.Series or None Market return series, required when beta_neutral=True.

Returns

pd.DataFrame Dates × 1 DataFrame of long-short portfolio returns. Column name is "factor_return". Index is the intersection of scores and returns dates. Missing periods (fewer than 2 * k valid observations) are filled with NaN.

Raises

ValueError If quantile is outside (0, 0.5] or weighting is unknown.

compute_cross_factor_correlation(factor_returns)

Compute the Pearson correlation matrix across factor-mimicking portfolios.

Parameters

factor_returns : pd.DataFrame Dates × factors DataFrame of long-short factor returns, as returned by build_factor_mimicking_portfolios (possibly concatenated across multiple factors).

Returns

pd.DataFrame Factors × factors symmetric correlation matrix. Diagonal entries are exactly 1.0. Computed on the rows where all factors have non-NaN returns (pairwise-complete otherwise).

compute_quintile_spread(scores, returns, n_quantiles=5)

Compute quintile portfolio returns and spread for a single factor.

At each date assets are ranked by factor score and split into n_quantiles equal-count buckets (Q1 = lowest scores, Qn = highest). Each bucket return is the equal-weight average of its members. The long-short spread is Qn − Q1.

Ties in scores are broken by rank order (method="first"), ensuring every bucket is populated at every date.

Parameters

scores : pd.DataFrame Dates × assets matrix of cross-sectional factor scores. returns : pd.DataFrame Dates × assets matrix of asset returns, aligned with scores. n_quantiles : int, default 5 Number of equal-count buckets. 5 = quintiles, 10 = deciles. Must be ≥ 2.

Returns

QuintileSpreadResult See :class:QuintileSpreadResult for field descriptions.

Raises

ValueError If n_quantiles < 2.

fit_gbt_composite(scores, forward_returns, max_depth=3, n_estimators=50)

Fit a gradient-boosted tree model mapping factor scores to forward returns.

Parameters

scores : pd.DataFrame Historical tickers x factors matrix (training observations). forward_returns : pd.Series Forward return per ticker for the training period. max_depth : int Maximum depth of individual regression trees (3–5 recommended to limit extrapolation and retain interpretability). n_estimators : int Number of boosting rounds.

Returns

GradientBoostingRegressor Fitted GBT model.

fit_ridge_composite(scores, forward_returns, alpha=1.0)

Fit a ridge regression model mapping factor scores to forward returns.

Parameters

scores : pd.DataFrame Historical tickers x factors matrix (training observations). Must be aligned with forward_returns on the index. forward_returns : pd.Series Forward return per ticker for the training period. alpha : float L2 regularisation strength. A single-element array is passed to RidgeCV so cross-validation still runs internally if multiple alphas are desired; here we keep one alpha for determinism.

Returns

RidgeCV Fitted ridge model. Call predict(scores) for new data.

predict_composite_scores(model, scores)

Apply a fitted ridge or GBT model to produce normalised composite scores.

The raw predictions are standardised to zero mean and unit variance so the output is on the same scale as z-score factor inputs.

Parameters

model : RidgeCV or GradientBoostingRegressor A model returned by :func:fit_ridge_composite or :func:fit_gbt_composite. scores : pd.DataFrame Current-period tickers x factors matrix.

Returns

pd.Series Normalised composite score per ticker (zero mean, unit variance). Tickers with all-NaN factor rows receive NaN.

run_factor_oos_validation(scores, returns, config=None, cpcv_config=None)

Rolling block or CPCV out-of-sample validation of factor IC and spreads.

Parameters

scores : pd.DataFrame Panel of standardised factor scores with a two-level row MultiIndex (date, ticker) and one column per factor. returns : pd.DataFrame Forward returns panel with the same (date, ticker) MultiIndex and a single return column. config : FactorOOSConfig or None Rolling window parameters. Defaults to FactorOOSConfig(). Ignored when cpcv_config is provided. cpcv_config : CPCVConfig or None When provided, uses combinatorial purged cross-validation instead of rolling blocks. Overrides config.

Returns

FactorOOSResult Per-fold and aggregate IC and quintile spread statistics.

Notes

The validation window computation uses only val-window dates; no training-window data is used. Fold count equals floor((total_months - train_months) / step_months) for rolling, or C(n_folds, n_test_folds) for CPCV.

apply_regime_tilts(group_weights, regime, config=None)

Apply regime-conditional multiplicative tilts to group weights.

Parameters

group_weights : dict[FactorGroupType, float] Base group weights. regime : MacroRegime Current macro regime. config : RegimeTiltConfig or None Tilt configuration.

Returns

dict[FactorGroupType, float] Tilted group weights (re-normalized to sum to original total).

classify_regime(macro_data)

Classify the current macro-economic regime.

Uses a simple heuristic based on GDP growth and leading indicators. The regime is determined by the latest observation's position relative to trend.

Parameters

macro_data : pd.DataFrame Macro indicators with columns that may include gdp_growth, leading_indicator, yield_spread, unemployment_rate. Index is date.

Returns

MacroRegime Current regime classification.

get_regime_tilts(regime, config=None)

Get multiplicative tilts for a given regime.

Parameters

regime : MacroRegime Current macro regime. config : RegimeTiltConfig or None Tilt configuration.

Returns

dict[FactorGroupType, float] Multiplicative tilt per group. Groups not listed get a tilt of 1.0.

compute_composite_score(standardized_factors, coverage, config=None, ic_history=None, training_scores=None, training_returns=None, group_weights=None)

Compute composite score from standardized factors.

Parameters

standardized_factors : pd.DataFrame Tickers x factors matrix. coverage : pd.DataFrame Boolean coverage matrix. config : CompositeScoringConfig or None Scoring configuration. ic_history : pd.DataFrame or None Required when config.method is IC_WEIGHTED or ICIR_WEIGHTED. Columns must match group names; each column is treated as the IC time series for that group. training_scores : pd.DataFrame or None Required when config.method is RIDGE_WEIGHTED or GBT_WEIGHTED. Historical tickers x factors matrix used to train the ML model (must not overlap with current-period data). training_returns : pd.Series or None Required when config.method is RIDGE_WEIGHTED or GBT_WEIGHTED. Forward returns aligned with training_scores. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). Threaded through to the inner scoring functions.

Returns

pd.Series Composite score per ticker.

compute_equal_weight_composite(group_scores, config=None, group_weights=None)

Equal-weight composite with core/supplementary tiering.

Parameters

group_scores : pd.DataFrame Tickers x groups matrix. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, skip tier-based derivation and use these weights directly.

Returns

pd.Series Composite score per ticker.

compute_group_scores(standardized_factors, coverage)

Average factor scores within each group.

Parameters

standardized_factors : pd.DataFrame Tickers x factors matrix of standardized scores. coverage : pd.DataFrame Boolean matrix of non-NaN coverage.

Returns

pd.DataFrame Tickers x groups matrix of group-level scores.

compute_ic_weighted_composite(group_scores, ic_history, config=None, group_weights=None)

IC-weighted composite score.

Uses trailing information coefficient history to weight groups.

Parameters

group_scores : pd.DataFrame Tickers x groups matrix. ic_history : pd.DataFrame Periods x groups matrix of IC values. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, use as tier multipliers instead of config core/supplementary weights.

Returns

pd.Series Composite score per ticker.

compute_icir_weighted_composite(group_scores, ic_series_per_group, config=None, group_weights=None)

ICIR-weighted composite score.

Weights each group by |ICIR| = |mean(IC) / std(IC)|, normalised to sum to 1. Groups with zero or undefined ICIR receive zero weight. Falls back to equal-weight when all groups have ICIR = 0.

Parameters

group_scores : pd.DataFrame Tickers x groups matrix. ic_series_per_group : dict[str, pd.Series] Per-group IC time series. Keys must match group_scores columns. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, use as tier multipliers instead of config core/supplementary weights.

Returns

pd.Series Composite score per ticker.

compute_ml_composite(standardized_factors, training_scores, training_returns, config)

ML composite score using ridge regression or gradient-boosted trees.

Trains the model on historical (training_scores, training_returns) and predicts on the current-period standardized_factors. The prediction is normalised to zero mean and unit variance.

The training window must end strictly before the prediction date to avoid look-ahead bias; callers are responsible for this temporal split.

Parameters

standardized_factors : pd.DataFrame Current-period tickers x factors matrix (prediction target). training_scores : pd.DataFrame Historical tickers x factors matrix aligned with training_returns. training_returns : pd.Series Forward return per ticker for the training period. config : CompositeScoringConfig Must have method set to RIDGE_WEIGHTED or GBT_WEIGHTED.

Returns

pd.Series Normalised composite score per ticker (zero mean, unit variance).

apply_sector_balance(selected, scores, sector_labels, parent_universe, tolerance=0.05)

Adjust selection for sector-proportional representation.

Ensures no sector is over- or under-represented relative to the parent universe by more than tolerance.

Parameters

selected : pd.Index Initially selected tickers. scores : pd.Series Composite scores for all candidates. sector_labels : pd.Series Sector label per ticker. parent_universe : pd.Index Full universe for computing target sector weights. tolerance : float Maximum deviation from parent sector weights.

Returns

pd.Index Sector-balanced selection.

compute_selection_turnover(current, new, universe)

Compute selection turnover as fraction of universe changed.

Parameters

current : pd.Index Currently selected tickers. new : pd.Index Newly selected tickers. universe : pd.Index Full investable universe.

Returns

float len(added | removed) / len(universe), or 0.0 if universe is empty.

select_fixed_count(scores, target_count, buffer_fraction=0.1, current_members=None)

Select top N stocks by composite score with buffer.

Parameters

scores : pd.Series Composite scores indexed by ticker. target_count : int Target number of stocks. buffer_fraction : float Buffer as a fraction of target_count. Current members within the buffer zone are retained. current_members : pd.Index or None Tickers currently selected.

Returns

pd.Index Selected tickers.

select_quantile(scores, target_quantile=0.8, exit_quantile=None, current_members=None)

Select stocks above a quantile threshold.

Parameters

scores : pd.Series Composite scores indexed by ticker. target_quantile : float Quantile threshold for entry (0-1). exit_quantile : float or None Quantile threshold for exit (hysteresis). If None, uses target_quantile. current_members : pd.Index or None Currently selected tickers.

Returns

pd.Index Selected tickers.

select_stocks(scores, config=None, current_members=None, sector_labels=None, parent_universe=None, return_turnover=False)

Select stocks from scored universe.

Parameters

scores : pd.Series Composite scores indexed by ticker. config : SelectionConfig or None Selection configuration. current_members : pd.Index or None Currently selected tickers for buffer/hysteresis. sector_labels : pd.Series or None Sector labels for sector balancing. parent_universe : pd.Index or None Full universe for sector weight targets. return_turnover : bool When True, return (selected, turnover) tuple.

Returns

pd.Index or tuple[pd.Index, float] Selected tickers, optionally with turnover.

neutralize_sector(scores, sector_labels, country_labels=None)

Demean scores within each sector (and optionally country).

Parameters

scores : pd.Series Standardized factor scores. sector_labels : pd.Series Sector label per ticker. country_labels : pd.Series or None Country label per ticker for country neutralization.

Returns

pd.Series Sector-neutralized scores.

orthogonalize_factors(factor_scores, method='pca', min_variance_explained=0.95)

Project factor scores onto orthogonal principal components.

Eliminates multicollinearity among factor scores by projecting them into a lower-dimensional PCA space. Retains the minimum number of components that explain at least min_variance_explained of the total variance.

Parameters

factor_scores : pd.DataFrame Tickers × factors matrix of factor scores. method : str Projection method. Only "pca" is supported. min_variance_explained : float Minimum cumulative explained variance ratio for retained components. Must be in (0, 1].

Returns

pd.DataFrame Tickers × PCs matrix with columns named PC1, PC2, .... Rows with NaN in the input are filled with NaN in the output but otherwise preserve the original index.

Raises

ConfigurationError If method is not "pca". DataError If fewer than 2 factors or fewer than 2 non-NaN observations.

rank_normal_standardize(scores)

Rank-normal (inverse normal) standardization.

Uses Phi^-1((rank - 0.5) / N) to map ranks to a normal distribution, robust to heavy-tailed distributions.

Parameters

scores : pd.Series Factor scores (may contain NaN).

Returns

pd.Series Rank-normalized scores.

standardize_all_factors(raw_factors, config=None, sector_labels=None, country_labels=None)

Standardize all factors and compute coverage.

Parameters

raw_factors : pd.DataFrame Tickers x factors matrix of raw values. config : StandardizationConfig or None Standardization parameters. sector_labels : pd.Series or None Sector labels for neutralization. country_labels : pd.Series or None Country labels for neutralization.

Returns

tuple[pd.DataFrame, pd.DataFrame] (standardized_scores, coverage) where coverage is a boolean DataFrame indicating non-NaN values.

standardize_factor(raw_scores, config=None, sector_labels=None, country_labels=None)

Full standardization pipeline for a single factor.

Parameters

raw_scores : pd.Series Raw factor values. config : StandardizationConfig or None Standardization parameters. sector_labels : pd.Series or None Sector labels for neutralization. country_labels : pd.Series or None Country labels for neutralization.

Returns

pd.Series Standardized factor scores.

winsorize_cross_section(scores, lower_pct=0.01, upper_pct=0.99)

Clip scores at percentile boundaries.

Parameters

scores : pd.Series Raw factor scores. lower_pct : float Lower percentile (0-1). upper_pct : float Upper percentile (0-1).

Returns

pd.Series Winsorized scores.

z_score_standardize(scores)

Z-score standardization: (x - mean) / std.

Parameters

scores : pd.Series Factor scores (may contain NaN).

Returns

pd.Series Standardized scores with mean 0 and std 1.

benjamini_hochberg(p_values, alpha=0.05)

Benjamini-Hochberg FDR correction.

Parameters

p_values : pd.Series Raw p-values indexed by factor name. alpha : float FDR significance level.

Returns

pd.Series Boolean series indicating significant factors.

compute_ic_series(factor_scores_history, returns_history, factor_name)

Compute IC time series for a factor.

Parameters

factor_scores_history : pd.DataFrame Dates x tickers matrix of factor scores. returns_history : pd.DataFrame Dates x tickers matrix of forward returns. factor_name : str Used only for labeling.

Returns

pd.Series IC values indexed by date.

compute_ic_stats(ic_series, lags=5)

Compute full IC statistics including Newey-West t-stat and ICIR.

Parameters

ic_series : pd.Series Time series of IC values (one per cross-section date). lags : int Number of lags for Newey-West HAC standard errors.

Returns

ICStats Dataclass containing mean, variance_nw, t_stat_nw, p_value, and icir.

compute_icir(ic_series)

Compute the IC Information Ratio (mean IC / std IC).

ICIR penalises factors with high average IC but also high IC volatility (inconsistent predictors). Use this as the weighting signal in ICIR-weighted composite scoring.

Parameters

ic_series : pd.Series Time series of IC values (one per cross-section date).

Returns

float ICIR value, or 0.0 if std(IC) == 0 or fewer than 2 non-NaN observations.

compute_monthly_ic(factor_scores, forward_returns)

Compute rank information coefficient (Spearman correlation).

Parameters

factor_scores : pd.Series Cross-sectional factor scores. forward_returns : pd.Series Forward returns for the same tickers.

Returns

float Rank IC (Spearman correlation).

compute_newey_west_tstat(ic_series, n_lags=6)

Compute Newey-West t-statistic for IC significance.

Parameters

ic_series : pd.Series Time series of IC values. n_lags : int Number of lags for HAC standard errors.

Returns

tuple[float, float] (t_statistic, p_value).

compute_quantile_spread(factor_scores, forward_returns, n_quantiles=5)

Compute long-short quantile spread return.

Parameters

factor_scores : pd.Series Cross-sectional factor scores. forward_returns : pd.Series Forward returns. n_quantiles : int Number of quantile buckets.

Returns

float Top quantile return minus bottom quantile return.

compute_vif(factor_matrix)

Compute variance inflation factors for multicollinearity.

Parameters

factor_matrix : pd.DataFrame Tickers x factors matrix (no NaN). Must contain at least 2 factors.

Returns

pd.Series VIF per factor. Values are ≥ 1.0 by construction.

Raises

ValueError If fewer than 2 factor columns are provided.

correct_pvalues(p_values, alpha=0.05)

Apply Holm-Bonferroni and Benjamini-Hochberg multiple testing corrections.

Parameters

p_values : ndarray, shape (m,) Raw p-values in any order. alpha : float Significance level used to compute the adjustments (does not filter here; callers compare adjusted p-values against alpha).

Returns

CorrectedPValues holm — FWER-controlling Holm-Bonferroni adjusted p-values. bh — FDR-controlling Benjamini-Hochberg adjusted p-values. Both arrays are returned in the same order as the input.

run_factor_validation(factor_scores_history, returns_history, config=None)

Run complete factor validation suite.

Parameters

factor_scores_history : dict[str, pd.DataFrame] Factor name -> (dates x tickers) score history. returns_history : pd.DataFrame Dates x tickers forward return matrix. config : FactorValidationConfig or None Validation parameters.

Returns

FactorValidationReport Complete validation results.

validate_factor_universe(ic_matrix, lags=5, alpha=0.05)

Validate all factors simultaneously with multiple testing correction.

Parameters

ic_matrix : pd.DataFrame Dates × factors matrix of IC values (one IC per period per factor). lags : int Number of Newey-West HAC lags. alpha : float Significance level for both FWER and FDR rejection decisions.

Returns

pd.DataFrame Factor × statistic summary with columns: ic_mean, icir, t_stat_nw, p_value_raw, p_value_holm, p_value_bh, significant_holm, significant_bh.