factors¶
optimizer.factors
¶
Factor construction, scoring, and selection for stock pre-selection.
CompositeMethod
¶
Bases: str, Enum
Composite scoring method.
CompositeScoringConfig
dataclass
¶
Configuration for composite score construction.
Parameters¶
method : CompositeMethod
Equal-weight, IC-weighted, ICIR-weighted, ridge, or GBT composite.
ic_lookback : int
Number of periods for IC estimation when using IC weighting.
core_weight : float
Relative weight for core factor groups.
supplementary_weight : float
Relative weight for supplementary factor groups.
ridge_alpha : float
L2 regularisation strength for RIDGE_WEIGHTED. Passed as the
single candidate to RidgeCV; increase for more shrinkage.
gbt_max_depth : int
Maximum tree depth for GBT_WEIGHTED.
gbt_n_estimators : int
Number of boosting rounds for GBT_WEIGHTED.
for_equal_weight()
classmethod
¶
Equal-weight composite scoring.
for_ic_weighted()
classmethod
¶
IC-weighted composite scoring (raw IC magnitude).
for_icir_weighted()
classmethod
¶
ICIR-weighted composite scoring (mean IC / std IC).
Penalises inconsistent predictors by dividing mean IC by IC volatility before normalising weights.
for_ridge_weighted()
classmethod
¶
Ridge regression composite scoring.
Learns optimal linear factor weights from historical data with L2 regularisation, avoiding the need for IC proxies.
for_gbt_weighted()
classmethod
¶
Gradient-boosted tree composite scoring.
Captures non-linear factor interactions (e.g. high value + improving momentum = stronger combined signal).
FactorConstructionConfig
dataclass
¶
Configuration for factor computation.
Parameters¶
factors : tuple[FactorType, ...]
Which factors to compute.
momentum_lookback : int
Lookback window for momentum in trading days.
momentum_skip : int
Recent days to skip for momentum (reversal avoidance).
volatility_lookback : int
Lookback window for volatility in trading days.
beta_lookback : int
Lookback window for beta estimation in trading days.
amihud_lookback : int
Lookback window for Amihud illiquidity in trading days.
publication_lag : PublicationLagConfig
Per-source publication lags for point-in-time correctness.
Pass a plain int for a uniform lag across all sources
(backward-compatible; converted to :class:PublicationLagConfig
automatically).
FactorGroupType
¶
Bases: str, Enum
Factor group taxonomy.
FactorIntegrationConfig
dataclass
¶
Configuration for bridging factor scores to optimization.
Parameters¶
risk_free_rate : float Annual risk-free rate for expected return mapping. market_risk_premium : float Annual equity risk premium. use_black_litterman : bool Whether to generate Black-Litterman views from factor scores. exposure_lower_bound : float Lower bound for factor exposure constraints. exposure_upper_bound : float Upper bound for factor exposure constraints.
FactorType
¶
Bases: str, Enum
Individual factor identifiers.
FactorValidationConfig
dataclass
¶
Configuration for factor validation and statistical testing.
Parameters¶
newey_west_lags : int Number of lags for Newey-West t-statistic. t_stat_threshold : float Minimum absolute t-statistic for significance. fdr_alpha : float False discovery rate alpha level. n_quantiles : int Number of quantiles for spread analysis. fmp_top_pct : float Top percentile for factor-mimicking portfolios. fmp_bottom_pct : float Bottom percentile for factor-mimicking portfolios.
GroupWeight
¶
Bases: str, Enum
Weight tier for factor groups.
MacroRegime
¶
Bases: str, Enum
Macro-economic regime classification.
PublicationLagConfig
dataclass
¶
Differentiated publication lags by data source type.
Each source has an independent delay between the period end date and the date the data is reliably available for use in factor construction. Using source-specific lags avoids look-ahead bias when aligning fundamental data to price dates.
Parameters¶
annual_days : int Lag for annual financial statements (days after fiscal year end). Default: 90 days (~3 months for 10-K filing). quarterly_days : int Lag for quarterly financial statements (days after quarter end). Default: 45 days (~6 weeks for 10-Q filing). analyst_days : int Lag for analyst estimates and recommendations. Default: 5 days (short dissemination buffer). macro_days : int Lag for macroeconomic indicators (release lag + revision lag). Default: 63 days (~2 months).
uniform(days)
classmethod
¶
Create a config with the same lag applied to all sources.
RegimeTiltConfig
dataclass
¶
Configuration for macro regime factor tilts.
Per-regime multiplicative tilts stored as tuples of
(group_name, tilt_factor) for frozen-dataclass compatibility.
Parameters¶
enable : bool Whether to apply regime tilts. expansion_tilts : tuple[tuple[str, float], ...] Group tilts during expansion. slowdown_tilts : tuple[tuple[str, float], ...] Group tilts during slowdown. recession_tilts : tuple[tuple[str, float], ...] Group tilts during recession. recovery_tilts : tuple[tuple[str, float], ...] Group tilts during recovery.
SelectionConfig
dataclass
¶
Configuration for stock selection from scored universe.
Parameters¶
method : SelectionMethod Fixed-count or quantile-based selection. target_count : int Number of stocks to select (for FIXED_COUNT). target_quantile : float Quantile threshold for selection (for QUANTILE, 0-1). exit_quantile : float Exit quantile for hysteresis (for QUANTILE). buffer_fraction : float Buffer zone fraction around selection boundary. sector_balance : bool Whether to enforce sector-proportional representation. sector_tolerance : float Maximum deviation from parent universe sector weights.
SelectionMethod
¶
Bases: str, Enum
Stock selection method.
StandardizationConfig
dataclass
¶
Configuration for cross-sectional factor standardization.
Parameters¶
method : StandardizationMethod Z-score or rank-normal standardization. winsorize_lower : float Lower percentile for winsorization (0-1). winsorize_upper : float Upper percentile for winsorization (0-1). neutralize_sector : bool Whether to sector-neutralize scores. neutralize_country : bool Whether to country-neutralize scores.
StandardizationMethod
¶
Bases: str, Enum
Cross-sectional standardization method.
FactorPCAResult
dataclass
¶
Principal component analysis result for a factor score matrix.
Attributes¶
explained_variance_ratio : ndarray, shape (n_components,)
Fraction of variance explained by each principal component,
sorted in descending order.
loadings : pd.DataFrame, shape (n_factors, n_components)
PCA loading matrix. Rows are factor names; columns are
PC1, PC2, ... . Each column is a unit eigenvector of
the correlation matrix of the factor scores.
n_components_95pct : int
Smallest number of components whose cumulative explained
variance ratio is ≥ 0.95.
FactorExposureConstraints
dataclass
¶
Enforceable linear constraints on portfolio factor exposure.
Encodes the set of per-factor inequalities::
lb_g <= sum_i w_i * z_{i,g} <= ub_g
as a pair of matrices ready to be passed directly to
:class:skfolio.optimization.MeanRisk (or any optimizer that
accepts left_inequality / right_inequality).
Parameters¶
left_inequality : np.ndarray of shape (2 * n_factors, n_assets)
Inequality matrix A in the constraint A @ w <= b.
Two rows per factor: -z (lower bound) and +z (upper bound).
right_inequality : np.ndarray of shape (2 * n_factors,)
Bound vector b in the constraint A @ w <= b.
factor_names : list[str]
Names of the constrained factors (in the same order as the row
pairs in left_inequality).
lower_bounds : np.ndarray of shape (n_factors,)
Lower exposure bound per factor.
upper_bounds : np.ndarray of shape (n_factors,)
Upper exposure bound per factor.
NetAlphaResult
dataclass
¶
Result of net alpha calculation after transaction cost deduction.
Attributes¶
gross_alpha : float
Annualised IC-based alpha proxy: mean(IC) * sqrt(annualisation).
avg_turnover : float
Mean one-way turnover across consecutive rebalancing dates, computed
via :func:~optimizer.rebalancing._rebalancer.compute_turnover.
total_cost : float
Cost deduction: avg_turnover * cost_bps / 10_000.
net_alpha : float
Net annualised alpha after cost deduction:
gross_alpha - total_cost.
net_icir : float
Net information coefficient information ratio:
net_alpha / (std(IC) * sqrt(annualisation)).
0.0 when the IC series has zero variance.
QuintileSpreadResult
dataclass
¶
Quintile spread analysis result for a single factor.
Attributes¶
quintile_returns : pd.DataFrame
Dates × Q1..Qn equal-weight portfolio returns per quantile bucket.
Q1 = bottom (lowest scores), Qn = top (highest scores).
spread_returns : pd.Series
Qn − Q1 long-short spread return series indexed by date.
Equals quintile_returns.iloc[:, -1] - quintile_returns.iloc[:, 0]
element-wise.
annualised_mean : float
spread_returns.mean() * 252.
t_stat : float
Two-tailed t-statistic: mean / (std / sqrt(T)).
sharpe : float
Annualised Sharpe ratio: mean * sqrt(252) / std.
FactorOOSConfig
dataclass
¶
Configuration for rolling block OOS validation.
Parameters¶
train_months : int Length of the training window in months. Default: 36. val_months : int Length of the validation window in months. Default: 12. step_months : int Number of months to roll forward between folds. Default: 6.
FactorOOSResult
dataclass
¶
Results from rolling block OOS factor validation.
Attributes¶
per_fold_ic : pd.DataFrame
n_folds × factors matrix of mean IC per fold per factor.
per_fold_spread : pd.DataFrame
n_folds × factors matrix of mean quintile spread per fold.
mean_oos_ic : pd.Series
Mean OOS IC aggregated across folds (one value per factor).
mean_oos_icir : pd.Series
OOS ICIR (mean IC / std IC across folds) per factor.
n_folds : int
Number of folds generated.
CorrectedPValues
dataclass
¶
Multiple-testing corrected p-values.
Attributes¶
holm : ndarray Holm-Bonferroni adjusted p-values (controls FWER). bh : ndarray Benjamini-Hochberg adjusted p-values (controls FDR).
FactorValidationReport
dataclass
¶
Complete validation report for all factors.
ICResult
dataclass
¶
Information coefficient analysis results for a single factor.
ICStats
dataclass
¶
Full IC statistics for a single factor including Newey-West inference.
Attributes¶
mean : float
Mean IC over the evaluation period.
variance_nw : float
Newey-West HAC variance of the IC series.
t_stat_nw : float
Newey-West adjusted t-statistic: IC_mean / sqrt(Var_NW / T).
p_value : float
Two-tailed p-value derived from the Newey-West t-statistic.
icir : float
Information Coefficient Information Ratio: mean(IC) / std(IC).
QuantileSpreadResult
dataclass
¶
Quantile spread analysis results for a single factor.
compute_gross_alpha(net_alpha, avg_turnover, cost_bps=10.0)
¶
Compute gross alpha by adding back estimated transaction costs.
Formula::
gross = net_alpha + avg_turnover * cost_bps / 10_000
Parameters¶
net_alpha : float Net alpha after transaction costs (annualised). avg_turnover : float Average one-way turnover (e.g. 0.5 means 50% of portfolio traded per period). cost_bps : float One-way transaction cost in basis points.
Returns¶
float Gross alpha before transaction costs.
factor_scores_to_expected_returns(scores, betas, factor_premiums, risk_free_rate=0.0)
¶
Convert factor Z-scores to expected returns via linear model.
Implements the formula::
E[r_i] = r_f + λ_mkt · β_i + Σ_g λ_g · z_{i,g}
where λ_mkt is read from factor_premiums["market"] and each
λ_g is read from factor_premiums[g] for factor group g.
Parameters¶
scores : pd.DataFrame
Assets × factor-groups matrix of standardised Z-scores. Rows are
ticker symbols; columns are factor group names (e.g. "value",
"momentum").
betas : pd.Series
Market (CAPM) beta per asset, indexed by ticker. Assets missing
from this Series are treated as having a beta of 1.0 (market
neutral assumption).
factor_premiums : dict[str, float]
Mapping of premium label → annualised premium (e.g.
{"market": 0.05, "value": 0.03, "momentum": 0.04}). The
reserved "market" key provides λ_mkt; all other keys are
matched against columns in scores.
risk_free_rate : float, default 0.0
Annualised risk-free rate r_f.
Returns¶
pd.Series
Annualised expected return per ticker, indexed by scores.index.
Examples¶
import pandas as pd scores = pd.DataFrame( ... {"value": [1.0, -1.0], "momentum": [0.5, 0.0]}, ... index=["AAPL", "MSFT"], ... ) betas = pd.Series({"AAPL": 1.2, "MSFT": 0.8}) factor_premiums = {"market": 0.05, "value": 0.03, "momentum": 0.04} factor_scores_to_expected_returns(scores, betas, factor_premiums, 0.02) AAPL 0.132 MSFT 0.018 dtype: float64
align_to_pit(data, period_date_col, as_of_date, lag_days, ticker_col='ticker')
¶
Filter time-series data to records published before as_of_date.
A record with period end date D is considered published
lag_days calendar days after D. A record is available as of
as_of_date only when D + lag_days <= as_of_date, equivalently
when D <= as_of_date - lag_days.
For each ticker, the most recent record satisfying the availability
constraint is returned so that callers receive a cross-sectional view
as of as_of_date.
Parameters¶
data : pd.DataFrame
Time-series data containing period_date_col and optionally
ticker_col.
period_date_col : str
Name of the column holding the period end date.
as_of_date : pd.Timestamp or str
The computation date. Only records available on or before this
date (after the lag has elapsed) are returned.
lag_days : int
Calendar days between period end and data availability.
ticker_col : str
Column holding the ticker identifier. Defaults to "ticker".
Returns¶
pd.DataFrame
Cross-sectional view: one row per ticker (the most recent
available record), indexed by ticker_col when present.
Returns an empty DataFrame with the same columns if no records
pass the cutoff.
compute_all_factors(fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None)
¶
Compute all configured factors.
Parameters¶
fundamentals : pd.DataFrame Cross-sectional data indexed by ticker. price_history : pd.DataFrame Price matrix (dates x tickers). volume_history : pd.DataFrame or None Volume matrix. analyst_data : pd.DataFrame or None Analyst recommendation data. insider_data : pd.DataFrame or None Insider transaction data. config : FactorConstructionConfig or None Construction parameters.
Returns¶
pd.DataFrame Tickers x factors matrix.
compute_factor(factor_type, fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None)
¶
Compute a single factor.
Parameters¶
factor_type : FactorType Which factor to compute. fundamentals : pd.DataFrame Cross-sectional data indexed by ticker. price_history : pd.DataFrame Price matrix (dates x tickers). volume_history : pd.DataFrame or None Volume matrix (dates x tickers). analyst_data : pd.DataFrame or None Analyst recommendation data. insider_data : pd.DataFrame or None Insider transaction data. config : FactorConstructionConfig or None Construction parameters.
Returns¶
pd.Series Factor values indexed by ticker.
check_survivorship_bias(returns, final_periods=12, zero_threshold=1e-10)
¶
Check for potential survivorship bias in a return panel.
Survivorship bias occurs when delisted or failed assets are excluded
from the sample. A simple heuristic: if no asset has near-zero
returns in the final final_periods rows (i.e., no asset appears
to have stopped trading), the panel may suffer from survivorship
bias.
Parameters¶
returns : pd.DataFrame Dates × assets return matrix. final_periods : int Number of trailing periods to inspect. zero_threshold : float Absolute threshold below which a return is considered "zero".
Returns¶
bool
True if survivorship bias is suspected, False otherwise.
compute_factor_pca(scores, n_components=None)
¶
Compute PCA on a cross-sectional factor score matrix.
Rows with any NaN are dropped before fitting. Scores are standardised (zero mean, unit variance per factor) so that PCA operates on the correlation structure rather than the covariance structure.
Parameters¶
scores : pd.DataFrame
Tickers × factors matrix of factor scores. Columns are factor
names; rows are asset observations.
n_components : int or None, default None
Number of principal components to retain. None keeps all
components (min(n_samples, n_features)).
Returns¶
FactorPCAResult
See :class:FactorPCAResult for field descriptions.
Raises¶
ValueError If fewer than 2 factors or fewer than 2 observations are available after dropping NaN rows.
flag_redundant_factors(scores, vif_threshold=10.0)
¶
Return factor names whose VIF exceeds vif_threshold.
A VIF above the threshold indicates that the factor's variance is largely explained by the remaining factors, making it a candidate for merging or removal from the composite score.
Parameters¶
scores : pd.DataFrame Tickers × factors matrix of factor scores. Must contain at least 2 factor columns. vif_threshold : float, default 10.0 VIF cutoff above which a factor is considered redundant. Commonly used values: 5 (conservative) or 10 (standard).
Returns¶
list[str]
Factor names with VIF > vif_threshold, in the order they
appear in scores.columns. Empty list if none exceed the
threshold.
Raises¶
ValueError
Propagated from :func:compute_vif if fewer than 2 factors
are provided.
build_factor_bl_views(factor_scores, factor_premia, selected_tickers)
¶
Generate Black-Litterman views from factor scores.
Creates relative views: top-scored assets outperform bottom-scored by the factor premium.
Parameters¶
factor_scores : pd.DataFrame Tickers x factors matrix of standardized scores. factor_premia : dict[str, float] Expected premium per factor. selected_tickers : pd.Index Tickers in the portfolio.
Returns¶
tuple[list[tuple[str, ...]], list[float]] (views, confidences) for Black-Litterman.
build_factor_exposure_constraints(factor_scores, bounds)
¶
Build enforceable linear factor exposure constraints.
For each factor g, the constraint enforces::
lb_g <= sum_i w_i * z_{i,g} <= ub_g
The result is expressed as left_inequality @ w <= right_inequality
(two rows per factor) and can be passed directly to
:class:skfolio.optimization.MeanRisk via its
left_inequality / right_inequality constructor arguments.
Parameters¶
factor_scores : pd.DataFrame
Tickers x factors matrix of standardised factor scores.
The tickers must match the assets used in the optimizer fit.
bounds : tuple[float, float] or dict[str, tuple[float, float]]
Exposure bounds applied to every factor (uniform) when given as a
single (lower, upper) tuple, or per-factor bounds when given as
a dict mapping factor name → (lower, upper).
Returns¶
FactorExposureConstraints
Dataclass holding left_inequality, right_inequality, and
metadata. Pass left_inequality and right_inequality as
keyword arguments to the optimizer.
Warns¶
UserWarning
If the equal-weight portfolio exposure lies outside [lb, ub]
for any factor (i.e. the constraint may be infeasible or very
tight under a balanced allocation).
compute_net_alpha(ic_series, weights_history, cost_bps=10.0, annualisation=252)
¶
Compute factor net alpha after deducting turnover-based transaction costs.
Combines IC-based gross alpha with the turnover cost from a weights history to produce a single net performance metric::
gross_alpha = mean(IC) * sqrt(annualisation)
avg_turnover = mean one-way turnover across rebalancing dates
total_cost = avg_turnover * cost_bps / 10_000
net_alpha = gross_alpha - total_cost
net_icir = net_alpha / (std(IC) * sqrt(annualisation))
Parameters¶
ic_series : pd.Series Time series of period information coefficients (Spearman or Pearson rank correlation between factor scores and forward returns), one value per rebalancing period. weights_history : pd.DataFrame Portfolio weights at each rebalancing date: rows = dates, columns = assets. Turnover is computed between every pair of consecutive rows. cost_bps : float, default=10.0 Round-trip transaction cost in basis points. annualisation : int, default=252 Number of periods per year (252 for daily, 12 for monthly).
Returns¶
NetAlphaResult
Dataclass with gross_alpha, avg_turnover, total_cost,
net_alpha, and net_icir.
estimate_factor_premia(factor_mimicking_returns)
¶
build_factor_mimicking_portfolios(scores, returns, quantile=0.3, weighting='equal', beta_neutral=False, market_returns=None)
¶
Build long-short factor-mimicking portfolio return time series.
For each date the top quantile fraction of assets (by factor score) are held long and the bottom quantile fraction are held short. The long-short return is the equal- or value-weighted long leg minus the corresponding short leg.
The function handles one factor at a time: scores is a dates × assets DataFrame encoding cross-sectional scores for a single factor. For multiple factors, call once per factor and concatenate the results::
factor_returns = pd.concat(
[
build_factor_mimicking_portfolios(scores_value, returns)
.rename(columns={"factor_return": "value"}),
build_factor_mimicking_portfolios(scores_mom, returns)
.rename(columns={"factor_return": "momentum"}),
],
axis=1,
)
Parameters¶
scores : pd.DataFrame
Dates × assets matrix of cross-sectional factor scores.
Index = dates; columns = asset tickers.
returns : pd.DataFrame
Dates × assets matrix of asset returns, aligned with scores
on the date index. Columns may be a superset or subset of
scores columns; the intersection is used.
quantile : float, default 0.30
Fraction of the asset universe assigned to each leg. Must be
in (0, 0.5].
weighting : {"equal", "value"}, default "equal"
Weighting scheme within each leg.
"equal" — every asset in the leg receives the same weight.
"value" — assets are weighted by the absolute value of
their factor score.
beta_neutral : bool, default False
When True, hedge the long-short portfolio against market
beta exposure. The hedge ratio adjusts the short-leg weight
so that the portfolio beta is approximately zero.
market_returns : pd.Series or None
Market return series, required when beta_neutral=True.
Returns¶
pd.DataFrame
Dates × 1 DataFrame of long-short portfolio returns. Column
name is "factor_return". Index is the intersection of
scores and returns dates. Missing periods (fewer than
2 * k valid observations) are filled with NaN.
Raises¶
ValueError
If quantile is outside (0, 0.5] or weighting is unknown.
compute_cross_factor_correlation(factor_returns)
¶
Compute the Pearson correlation matrix across factor-mimicking portfolios.
Parameters¶
factor_returns : pd.DataFrame
Dates × factors DataFrame of long-short factor returns, as
returned by build_factor_mimicking_portfolios (possibly
concatenated across multiple factors).
Returns¶
pd.DataFrame Factors × factors symmetric correlation matrix. Diagonal entries are exactly 1.0. Computed on the rows where all factors have non-NaN returns (pairwise-complete otherwise).
compute_quintile_spread(scores, returns, n_quantiles=5)
¶
Compute quintile portfolio returns and spread for a single factor.
At each date assets are ranked by factor score and split into n_quantiles equal-count buckets (Q1 = lowest scores, Qn = highest). Each bucket return is the equal-weight average of its members. The long-short spread is Qn − Q1.
Ties in scores are broken by rank order (method="first"), ensuring
every bucket is populated at every date.
Parameters¶
scores : pd.DataFrame Dates × assets matrix of cross-sectional factor scores. returns : pd.DataFrame Dates × assets matrix of asset returns, aligned with scores. n_quantiles : int, default 5 Number of equal-count buckets. 5 = quintiles, 10 = deciles. Must be ≥ 2.
Returns¶
QuintileSpreadResult
See :class:QuintileSpreadResult for field descriptions.
Raises¶
ValueError If n_quantiles < 2.
fit_gbt_composite(scores, forward_returns, max_depth=3, n_estimators=50)
¶
Fit a gradient-boosted tree model mapping factor scores to forward returns.
Parameters¶
scores : pd.DataFrame Historical tickers x factors matrix (training observations). forward_returns : pd.Series Forward return per ticker for the training period. max_depth : int Maximum depth of individual regression trees (3–5 recommended to limit extrapolation and retain interpretability). n_estimators : int Number of boosting rounds.
Returns¶
GradientBoostingRegressor Fitted GBT model.
fit_ridge_composite(scores, forward_returns, alpha=1.0)
¶
Fit a ridge regression model mapping factor scores to forward returns.
Parameters¶
scores : pd.DataFrame
Historical tickers x factors matrix (training observations).
Must be aligned with forward_returns on the index.
forward_returns : pd.Series
Forward return per ticker for the training period.
alpha : float
L2 regularisation strength. A single-element array is passed to
RidgeCV so cross-validation still runs internally if multiple
alphas are desired; here we keep one alpha for determinism.
Returns¶
RidgeCV
Fitted ridge model. Call predict(scores) for new data.
predict_composite_scores(model, scores)
¶
Apply a fitted ridge or GBT model to produce normalised composite scores.
The raw predictions are standardised to zero mean and unit variance so the output is on the same scale as z-score factor inputs.
Parameters¶
model : RidgeCV or GradientBoostingRegressor
A model returned by :func:fit_ridge_composite or
:func:fit_gbt_composite.
scores : pd.DataFrame
Current-period tickers x factors matrix.
Returns¶
pd.Series
Normalised composite score per ticker (zero mean, unit variance).
Tickers with all-NaN factor rows receive NaN.
run_factor_oos_validation(scores, returns, config=None, cpcv_config=None)
¶
Rolling block or CPCV out-of-sample validation of factor IC and spreads.
Parameters¶
scores : pd.DataFrame
Panel of standardised factor scores with a two-level row MultiIndex
(date, ticker) and one column per factor.
returns : pd.DataFrame
Forward returns panel with the same (date, ticker) MultiIndex
and a single return column.
config : FactorOOSConfig or None
Rolling window parameters. Defaults to FactorOOSConfig().
Ignored when cpcv_config is provided.
cpcv_config : CPCVConfig or None
When provided, uses combinatorial purged cross-validation
instead of rolling blocks. Overrides config.
Returns¶
FactorOOSResult Per-fold and aggregate IC and quintile spread statistics.
Notes¶
The validation window computation uses only val-window dates; no
training-window data is used. Fold count equals
floor((total_months - train_months) / step_months) for rolling,
or C(n_folds, n_test_folds) for CPCV.
apply_regime_tilts(group_weights, regime, config=None)
¶
Apply regime-conditional multiplicative tilts to group weights.
Parameters¶
group_weights : dict[FactorGroupType, float] Base group weights. regime : MacroRegime Current macro regime. config : RegimeTiltConfig or None Tilt configuration.
Returns¶
dict[FactorGroupType, float] Tilted group weights (re-normalized to sum to original total).
classify_regime(macro_data)
¶
Classify the current macro-economic regime.
Uses a simple heuristic based on GDP growth and leading indicators. The regime is determined by the latest observation's position relative to trend.
Parameters¶
macro_data : pd.DataFrame
Macro indicators with columns that may include
gdp_growth, leading_indicator, yield_spread,
unemployment_rate. Index is date.
Returns¶
MacroRegime Current regime classification.
get_regime_tilts(regime, config=None)
¶
compute_composite_score(standardized_factors, coverage, config=None, ic_history=None, training_scores=None, training_returns=None, group_weights=None)
¶
Compute composite score from standardized factors.
Parameters¶
standardized_factors : pd.DataFrame
Tickers x factors matrix.
coverage : pd.DataFrame
Boolean coverage matrix.
config : CompositeScoringConfig or None
Scoring configuration.
ic_history : pd.DataFrame or None
Required when config.method is IC_WEIGHTED or
ICIR_WEIGHTED. Columns must match group names; each column
is treated as the IC time series for that group.
training_scores : pd.DataFrame or None
Required when config.method is RIDGE_WEIGHTED or
GBT_WEIGHTED. Historical tickers x factors matrix used to
train the ML model (must not overlap with current-period data).
training_returns : pd.Series or None
Required when config.method is RIDGE_WEIGHTED or
GBT_WEIGHTED. Forward returns aligned with training_scores.
group_weights : dict[str, float] or None
Pre-computed group weights (e.g. from regime tilts). Threaded
through to the inner scoring functions.
Returns¶
pd.Series Composite score per ticker.
compute_equal_weight_composite(group_scores, config=None, group_weights=None)
¶
Equal-weight composite with core/supplementary tiering.
Parameters¶
group_scores : pd.DataFrame Tickers x groups matrix. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, skip tier-based derivation and use these weights directly.
Returns¶
pd.Series Composite score per ticker.
compute_group_scores(standardized_factors, coverage)
¶
compute_ic_weighted_composite(group_scores, ic_history, config=None, group_weights=None)
¶
IC-weighted composite score.
Uses trailing information coefficient history to weight groups.
Parameters¶
group_scores : pd.DataFrame Tickers x groups matrix. ic_history : pd.DataFrame Periods x groups matrix of IC values. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, use as tier multipliers instead of config core/supplementary weights.
Returns¶
pd.Series Composite score per ticker.
compute_icir_weighted_composite(group_scores, ic_series_per_group, config=None, group_weights=None)
¶
ICIR-weighted composite score.
Weights each group by |ICIR| = |mean(IC) / std(IC)|, normalised
to sum to 1. Groups with zero or undefined ICIR receive zero weight.
Falls back to equal-weight when all groups have ICIR = 0.
Parameters¶
group_scores : pd.DataFrame
Tickers x groups matrix.
ic_series_per_group : dict[str, pd.Series]
Per-group IC time series. Keys must match group_scores columns.
config : CompositeScoringConfig or None
Scoring configuration.
group_weights : dict[str, float] or None
Pre-computed group weights (e.g. from regime tilts). When provided,
use as tier multipliers instead of config core/supplementary weights.
Returns¶
pd.Series Composite score per ticker.
compute_ml_composite(standardized_factors, training_scores, training_returns, config)
¶
ML composite score using ridge regression or gradient-boosted trees.
Trains the model on historical (training_scores, training_returns)
and predicts on the current-period standardized_factors. The
prediction is normalised to zero mean and unit variance.
The training window must end strictly before the prediction date to avoid look-ahead bias; callers are responsible for this temporal split.
Parameters¶
standardized_factors : pd.DataFrame
Current-period tickers x factors matrix (prediction target).
training_scores : pd.DataFrame
Historical tickers x factors matrix aligned with
training_returns.
training_returns : pd.Series
Forward return per ticker for the training period.
config : CompositeScoringConfig
Must have method set to RIDGE_WEIGHTED or GBT_WEIGHTED.
Returns¶
pd.Series Normalised composite score per ticker (zero mean, unit variance).
apply_sector_balance(selected, scores, sector_labels, parent_universe, tolerance=0.05)
¶
Adjust selection for sector-proportional representation.
Ensures no sector is over- or under-represented relative to
the parent universe by more than tolerance.
Parameters¶
selected : pd.Index Initially selected tickers. scores : pd.Series Composite scores for all candidates. sector_labels : pd.Series Sector label per ticker. parent_universe : pd.Index Full universe for computing target sector weights. tolerance : float Maximum deviation from parent sector weights.
Returns¶
pd.Index Sector-balanced selection.
compute_selection_turnover(current, new, universe)
¶
select_fixed_count(scores, target_count, buffer_fraction=0.1, current_members=None)
¶
Select top N stocks by composite score with buffer.
Parameters¶
scores : pd.Series Composite scores indexed by ticker. target_count : int Target number of stocks. buffer_fraction : float Buffer as a fraction of target_count. Current members within the buffer zone are retained. current_members : pd.Index or None Tickers currently selected.
Returns¶
pd.Index Selected tickers.
select_quantile(scores, target_quantile=0.8, exit_quantile=None, current_members=None)
¶
Select stocks above a quantile threshold.
Parameters¶
scores : pd.Series
Composite scores indexed by ticker.
target_quantile : float
Quantile threshold for entry (0-1).
exit_quantile : float or None
Quantile threshold for exit (hysteresis). If None,
uses target_quantile.
current_members : pd.Index or None
Currently selected tickers.
Returns¶
pd.Index Selected tickers.
select_stocks(scores, config=None, current_members=None, sector_labels=None, parent_universe=None, return_turnover=False)
¶
Select stocks from scored universe.
Parameters¶
scores : pd.Series
Composite scores indexed by ticker.
config : SelectionConfig or None
Selection configuration.
current_members : pd.Index or None
Currently selected tickers for buffer/hysteresis.
sector_labels : pd.Series or None
Sector labels for sector balancing.
parent_universe : pd.Index or None
Full universe for sector weight targets.
return_turnover : bool
When True, return (selected, turnover) tuple.
Returns¶
pd.Index or tuple[pd.Index, float] Selected tickers, optionally with turnover.
neutralize_sector(scores, sector_labels, country_labels=None)
¶
Demean scores within each sector (and optionally country).
Parameters¶
scores : pd.Series Standardized factor scores. sector_labels : pd.Series Sector label per ticker. country_labels : pd.Series or None Country label per ticker for country neutralization.
Returns¶
pd.Series Sector-neutralized scores.
orthogonalize_factors(factor_scores, method='pca', min_variance_explained=0.95)
¶
Project factor scores onto orthogonal principal components.
Eliminates multicollinearity among factor scores by projecting
them into a lower-dimensional PCA space. Retains the minimum
number of components that explain at least min_variance_explained
of the total variance.
Parameters¶
factor_scores : pd.DataFrame
Tickers × factors matrix of factor scores.
method : str
Projection method. Only "pca" is supported.
min_variance_explained : float
Minimum cumulative explained variance ratio for retained
components. Must be in (0, 1].
Returns¶
pd.DataFrame
Tickers × PCs matrix with columns named PC1, PC2, ....
Rows with NaN in the input are filled with NaN in the output
but otherwise preserve the original index.
Raises¶
ConfigurationError
If method is not "pca".
DataError
If fewer than 2 factors or fewer than 2 non-NaN observations.
rank_normal_standardize(scores)
¶
standardize_all_factors(raw_factors, config=None, sector_labels=None, country_labels=None)
¶
Standardize all factors and compute coverage.
Parameters¶
raw_factors : pd.DataFrame Tickers x factors matrix of raw values. config : StandardizationConfig or None Standardization parameters. sector_labels : pd.Series or None Sector labels for neutralization. country_labels : pd.Series or None Country labels for neutralization.
Returns¶
tuple[pd.DataFrame, pd.DataFrame] (standardized_scores, coverage) where coverage is a boolean DataFrame indicating non-NaN values.
standardize_factor(raw_scores, config=None, sector_labels=None, country_labels=None)
¶
Full standardization pipeline for a single factor.
Parameters¶
raw_scores : pd.Series Raw factor values. config : StandardizationConfig or None Standardization parameters. sector_labels : pd.Series or None Sector labels for neutralization. country_labels : pd.Series or None Country labels for neutralization.
Returns¶
pd.Series Standardized factor scores.
winsorize_cross_section(scores, lower_pct=0.01, upper_pct=0.99)
¶
z_score_standardize(scores)
¶
benjamini_hochberg(p_values, alpha=0.05)
¶
compute_ic_series(factor_scores_history, returns_history, factor_name)
¶
compute_ic_stats(ic_series, lags=5)
¶
Compute full IC statistics including Newey-West t-stat and ICIR.
Parameters¶
ic_series : pd.Series Time series of IC values (one per cross-section date). lags : int Number of lags for Newey-West HAC standard errors.
Returns¶
ICStats
Dataclass containing mean, variance_nw, t_stat_nw,
p_value, and icir.
compute_icir(ic_series)
¶
Compute the IC Information Ratio (mean IC / std IC).
ICIR penalises factors with high average IC but also high IC volatility (inconsistent predictors). Use this as the weighting signal in ICIR-weighted composite scoring.
Parameters¶
ic_series : pd.Series Time series of IC values (one per cross-section date).
Returns¶
float
ICIR value, or 0.0 if std(IC) == 0 or fewer than
2 non-NaN observations.
compute_monthly_ic(factor_scores, forward_returns)
¶
compute_newey_west_tstat(ic_series, n_lags=6)
¶
compute_quantile_spread(factor_scores, forward_returns, n_quantiles=5)
¶
compute_vif(factor_matrix)
¶
correct_pvalues(p_values, alpha=0.05)
¶
Apply Holm-Bonferroni and Benjamini-Hochberg multiple testing corrections.
Parameters¶
p_values : ndarray, shape (m,)
Raw p-values in any order.
alpha : float
Significance level used to compute the adjustments (does not filter
here; callers compare adjusted p-values against alpha).
Returns¶
CorrectedPValues
holm — FWER-controlling Holm-Bonferroni adjusted p-values.
bh — FDR-controlling Benjamini-Hochberg adjusted p-values.
Both arrays are returned in the same order as the input.
run_factor_validation(factor_scores_history, returns_history, config=None)
¶
Run complete factor validation suite.
Parameters¶
factor_scores_history : dict[str, pd.DataFrame] Factor name -> (dates x tickers) score history. returns_history : pd.DataFrame Dates x tickers forward return matrix. config : FactorValidationConfig or None Validation parameters.
Returns¶
FactorValidationReport Complete validation results.
validate_factor_universe(ic_matrix, lags=5, alpha=0.05)
¶
Validate all factors simultaneously with multiple testing correction.
Parameters¶
ic_matrix : pd.DataFrame Dates × factors matrix of IC values (one IC per period per factor). lags : int Number of Newey-West HAC lags. alpha : float Significance level for both FWER and FDR rejection decisions.
Returns¶
pd.DataFrame
Factor × statistic summary with columns:
ic_mean, icir, t_stat_nw, p_value_raw,
p_value_holm, p_value_bh, significant_holm,
significant_bh.