factors¶
optimizer.factors
¶
Factor construction, scoring, and selection for stock pre-selection.
CompositeMethod
¶
Bases: str, Enum
Composite scoring method.
CompositeScoringConfig
dataclass
¶
Configuration for composite score construction.
Parameters¶
method : CompositeMethod
Equal-weight, IC-weighted, ICIR-weighted, ridge, or GBT composite.
ic_lookback : int
Number of periods for IC estimation when using IC weighting.
core_weight : float
Relative weight for core factor groups.
supplementary_weight : float
Relative weight for supplementary factor groups.
ridge_alpha : float
L2 regularisation strength for RIDGE_WEIGHTED. Passed as the
single candidate to RidgeCV; increase for more shrinkage.
gbt_max_depth : int
Maximum tree depth for GBT_WEIGHTED.
gbt_n_estimators : int
Number of boosting rounds for GBT_WEIGHTED.
gbt_random_state : int
Random state for GBT_WEIGHTED GradientBoostingRegressor.
Change for sensitivity analysis or ensemble diversity.
min_coverage_groups : int
Minimum number of non-NaN group scores required. Tickers with
fewer available groups receive NaN composite and are excluded from
selection. 0 disables the threshold (default).
return_coverage : bool
When True, compute_composite_score returns a DataFrame with
columns ["composite", "coverage_ratio"] instead of a Series.
ic_fallback_strategy : ICFallbackStrategy
Strategy when all IC/ICIR weights resolve to zero (all groups have
non-positive IC or ICIR). EQUAL_WEIGHT preserves the current
behavior. NAN returns all-NaN scores to suppress trading.
RAISE raises ConfigurationError. Default is EQUAL_WEIGHT.
for_equal_weight()
classmethod
¶
Equal-weight composite scoring.
for_ic_weighted()
classmethod
¶
IC-weighted composite scoring (raw IC magnitude).
for_icir_weighted()
classmethod
¶
ICIR-weighted composite scoring (mean IC / std IC).
Penalises inconsistent predictors by dividing mean IC by IC volatility before normalising weights.
for_ridge_weighted()
classmethod
¶
Ridge regression composite scoring.
Learns optimal linear factor weights from historical data with L2 regularisation, avoiding the need for IC proxies.
for_gbt_weighted()
classmethod
¶
Gradient-boosted tree composite scoring.
Captures non-linear factor interactions (e.g. high value + improving momentum = stronger combined signal).
for_ic_weighted_robust()
classmethod
¶
IC-weighted scoring with minimum coverage of 3 groups.
for_sparse_universe()
classmethod
¶
Equal-weight scoring with minimum coverage of 2 groups.
for_coverage_diagnostics()
classmethod
¶
Equal-weight scoring returning coverage_ratio alongside composite.
for_ic_weighted_raise_on_fallback()
classmethod
¶
IC-weighted scoring that raises if all groups have negative IC.
FactorBuildHealth
dataclass
¶
Diagnostic report from build_factor_scores_history().
Parameters¶
total_dates : int Number of rebalancing dates attempted. succeeded_dates : int Number of dates for which factor computation succeeded. failed_dates : int Number of dates skipped due to errors. failures : dict[str, str] Mapping of ISO-date string to exception message for each failure. min_success_fraction : float Minimum fraction of succeeded/total required before FactorCoverageError is raised.
FactorConstructionConfig
dataclass
¶
Configuration for factor computation.
Parameters¶
factors : tuple[FactorType, ...]
Which factors to compute.
momentum_lookback : int
Lookback window for momentum in trading days.
momentum_skip : int
Recent days to skip for momentum (reversal avoidance).
volatility_lookback : int
Lookback window for volatility in trading days.
beta_lookback : int
Lookback window for beta estimation in trading days.
amihud_lookback : int
Lookback window for Amihud illiquidity in trading days.
publication_lag : PublicationLagConfig
Per-source publication lags for point-in-time correctness.
Pass a plain int for a uniform lag across all sources
(backward-compatible; converted to :class:PublicationLagConfig
automatically).
FactorGroupType
¶
Bases: str, Enum
Factor group taxonomy.
FactorIntegrationConfig
dataclass
¶
Configuration for bridging factor scores to optimization.
Parameters¶
risk_free_rate : float Annual risk-free rate for expected return mapping. market_risk_premium : float Annual equity risk premium. score_premium : float Annualized premium per unit of composite z-score. use_black_litterman : bool Whether to generate Black-Litterman views from factor scores. view_confidence_cap : float Maximum Idzorek confidence for BL views (0–1). At 1.0 the posterior equals the view exactly, causing extreme concentration. Values 0.25–0.50 blend the view with the equilibrium prior. max_weight : float Maximum per-asset weight enforced on the optimizer when the integration injects a BL prior. 0.0 disables the constraint. exposure_lower_bound : float Lower bound for factor exposure constraints. exposure_upper_bound : float Upper bound for factor exposure constraints.
FactorType
¶
Bases: str, Enum
Individual factor identifiers.
FactorValidationConfig
dataclass
¶
Configuration for factor validation and statistical testing.
Parameters¶
newey_west_lags : int
Number of lags for Newey-West t-statistic.
t_stat_threshold : float
Minimum absolute t-statistic for significance.
fdr_alpha : float
False discovery rate alpha level.
n_quantiles : int
Number of quantiles for spread analysis.
fmp_top_pct : float
Top percentile for factor-mimicking portfolios.
fmp_bottom_pct : float
Bottom percentile for factor-mimicking portfolios.
composite_min_observations : int
Minimum non-NaN observations per cross-section for composite IC.
Default: 24. Newey-West with 6 lags requires at least 13 observations
(2*lags+1); 24 provides two years of monthly IC for reliable
Spearman rank correlations.
min_ic_observations : int
Minimum non-NaN observations per cross-section date for per-factor
IC computation in run_factor_validation. Default: 24, matching
composite_min_observations so both paths apply consistent
minimum-data guards.
GroupICAggregationConfig
dataclass
¶
Configuration for group-level IC aggregation.
Controls how per-factor ICs are combined within each factor group.
Parameters¶
weighting : ICWeightingMethod
Method for weighting per-factor ICs within a group.
negative_filter : ICNegativeFilterPolicy
Policy for handling factors with consistently negative IC.
min_observations_tstat : int
Minimum IC observations to compute a valid t-stat.
Factors below this threshold fall back to equal weight
when weighting=TSTAT_WEIGHTED. Default: 24.
newey_west_lags : int
Number of lags for Newey-West HAC standard errors when
computing t-stat weights.
for_simple_mean()
classmethod
¶
Default: simple arithmetic mean, no filtering.
for_tstat_weighted()
classmethod
¶
Weight factor ICs by absolute Newey-West t-stat.
for_excluding_negative()
classmethod
¶
Exclude factors with consistently negative IC.
for_robust()
classmethod
¶
T-stat weighted with negative IC exclusion.
GroupWeight
¶
Bases: str, Enum
Weight tier for factor groups.
ICFallbackStrategy
¶
Bases: str, Enum
Strategy when all IC/ICIR weights are zero or negative.
Applied by IC-weighted and ICIR-weighted composite scoring when every factor group has non-positive IC or ICIR and the total weight sums to zero.
ICNegativeFilterPolicy
¶
Bases: str, Enum
Policy for handling factors with consistently negative IC.
INCLUDE keeps all factors regardless of IC sign (current default). EXCLUDE removes negative-IC factors before computing the group average (denominator shrinks). SOFT zeros the contribution of negative-IC factors but keeps them in the denominator (dampening).
ICWeightingMethod
¶
Bases: str, Enum
Method for aggregating per-factor ICs within a group.
SIMPLE_MEAN averages ICs equally (current default behaviour). TSTAT_WEIGHTED uses absolute Newey-West t-stat as weight so that factors with more statistically significant ICs dominate the group average.
MacroRegime
¶
Bases: str, Enum
Macro-economic regime classification.
PublicationLagConfig
dataclass
¶
Differentiated publication lags by data source type.
Each source has an independent delay between the period end date and the date the data is reliably available for use in factor construction. Using source-specific lags avoids look-ahead bias when aligning fundamental data to price dates.
Parameters¶
annual_days : int Lag for annual financial statements (days after fiscal year end). Default: 90 days (~3 months for 10-K filing). quarterly_days : int Lag for quarterly financial statements (days after quarter end). Default: 45 days (~6 weeks for 10-Q filing). analyst_days : int Lag for analyst estimates and recommendations. Default: 5 days (short dissemination buffer). macro_days : int Lag for macroeconomic indicators (release lag + revision lag). Default: 63 days (~2 months).
uniform(days)
classmethod
¶
Create a config with the same lag applied to all sources.
RegimeThresholdConfig
dataclass
¶
Classification thresholds for the composite macro regime scorer.
All eight thresholds drive the {-1, 0, +1} component scores used by
:func:~optimizer.factors._regime.classify_regime_composite and the
research-layer scoring functions in research/_macro.py.
Parameters¶
hy_oas_risk_on : float HY OAS level (bps) below which credit conditions are benign (+1). Empirical basis: ~40th pctl of ICE BofA HY OAS 1997-2023. hy_oas_risk_off : float HY OAS level (bps) above which credit stress is elevated (-1). Empirical basis: ~75th pctl of ICE BofA HY OAS historically. pmi_expansion : float ISM Manufacturing PMI above which growth is accelerating (+1). 2-point buffer above the 50 neutral line (Koenig 2002). pmi_contraction : float ISM Manufacturing PMI below which growth is contracting (-1). Symmetric 2-point band around 50. spread_2s10s_steep : float 10Y-2Y spread (percentage points) above which the curve is steep (+1). 100 bps historically associated with early-cycle acceleration. spread_2s10s_inversion : float 10Y-2Y spread (percentage points) at/below which the curve is inverted (-1). Conventional inversion definition (Estrella & Mishkin 1998). sentiment_positive : float Normalized NLP sentiment score above which sentiment is positive (+1). sentiment_negative : float Normalized NLP sentiment score below which sentiment is negative (-1).
for_empirical()
classmethod
¶
Canonical empirical thresholds (Chapter 7 calibration).
for_rolling_percentile(hy_series=None, spread_series=None, pmi_series=None, sentiment_series=None, hy_risk_on_pct=0.4, hy_risk_off_pct=0.75, pmi_expansion_pct=0.6, pmi_contraction_pct=0.4, spread_steep_pct=0.65, sentiment_positive_pct=0.7)
classmethod
¶
Compute thresholds from trailing empirical distributions.
Pass historical Series for each indicator; thresholds are set at
the specified percentiles. Any None series falls back to the
hard-coded empirical default for that indicator.
RegimeTiltConfig
dataclass
¶
Configuration for macro regime factor tilts.
Per-regime multiplicative tilts stored as tuples of
(group_name, tilt_factor) for frozen-dataclass compatibility.
Parameters¶
enable : bool Whether to apply regime tilts. expansion_tilts : tuple[tuple[str, float], ...] Group tilts during expansion. slowdown_tilts : tuple[tuple[str, float], ...] Group tilts during slowdown. recession_tilts : tuple[tuple[str, float], ...] Group tilts during recession. recovery_tilts : tuple[tuple[str, float], ...] Group tilts during recovery. unknown_tilts : tuple[tuple[str, float], ...] Group tilts when regime is unknown (neutral — all multipliers default to 1.0 via empty tuple). max_tilt_multiplier : float Upper bound on any single raw tilt multiplier (default 2.0). Multipliers exceeding this value are clamped before application. Must be >= 1.0. min_post_tilt_weight : float Minimum weight any group may hold after tilting, expressed as a fraction of the original total weight (default 0.05). Groups suppressed below this floor are raised to it before renormalization. Must be in [0.0, 1.0).
for_moderate_tilts()
classmethod
¶
Enable moderate regime-conditional tilts.
for_no_tilts()
classmethod
¶
Disable regime tilts (default).
for_strict_bounds()
classmethod
¶
Enable tilts with tight multiplier cap and weight floor.
Caps each raw tilt multiplier at 1.5x and prevents any group from falling below 10% of the total weight. Suitable for mandates requiring diversification guarantees.
SelectionConfig
dataclass
¶
Configuration for stock selection from scored universe.
Parameters¶
method : SelectionMethod
Fixed-count or quantile-based selection.
target_count : int
Number of stocks to select (for FIXED_COUNT).
target_quantile : float
Quantile threshold for selection (for QUANTILE, 0-1).
exit_quantile : float
Exit quantile for hysteresis (for QUANTILE).
buffer_fraction : float
Buffer zone fraction around selection boundary.
sector_balance : bool
Whether to enforce sector-proportional representation.
sector_tolerance : float
Maximum deviation from parent universe sector weights (fraction,
0–1). Default 0.05 (5 pp) matches MSCI, S&P DJI, and FTSE Russell
factor-index methodology. Use for_low_tracking_error() for a
tighter 3% band suited to institutional low-active-risk mandates.
for_top_100()
classmethod
¶
Select top 100 stocks by composite score.
for_top_quintile()
classmethod
¶
Select top quintile by composite score.
for_top_20()
classmethod
¶
Select top 20 stocks — concentrated diversified portfolio.
Uses relaxed sector tolerance (10%) because at 20 stocks each addition/removal changes sector weight by ~5%. Buffer of 3 stocks (15%) reduces unnecessary turnover.
for_concentrated()
classmethod
¶
Concentrated portfolio of top 30 stocks.
for_low_tracking_error()
classmethod
¶
Top 100 stocks with tighter sector tolerance for low tracking error.
Uses a 3% sector deviation cap (vs. the standard 5%) to more closely replicate the sector composition of the parent benchmark, matching the tighter band used by institutional index providers (e.g., MSCI Minimum Volatility) when minimising active sector bets is a mandate.
SelectionMethod
¶
Bases: str, Enum
Stock selection method.
StandardizationConfig
dataclass
¶
Configuration for cross-sectional factor standardization.
Parameters¶
method : StandardizationMethod
Z-score or rank-normal standardization. Default is RANK_NORMAL
following MSCI Barra USE4 and Gu/Kelly/Xiu (2020) best practice for
heavy-tailed financial factor distributions.
winsorize_method : WinsorizeMethod
Outlier treatment method. PERCENTILE clips at fixed quantiles;
MAD clips at median +/- k * 1.4826 * MAD.
winsorize_lower : float
Lower percentile for winsorization (0-1, used with PERCENTILE).
winsorize_upper : float
Upper percentile for winsorization (0-1, used with PERCENTILE).
neutralize_sector : bool
Whether to sector-neutralize scores.
neutralize_country : bool
Whether to country-neutralize scores.
factor_method_overrides : tuple[tuple[str, str], ...]
Per-factor standardization method overrides as
(factor_name, method_value) pairs. When non-empty, each factor
is standardized with its assigned method; factors not in the map
fall back to method.
for_heavy_tailed()
classmethod
¶
Rank-normal for heavy-tailed distributions (e.g. value ratios).
for_normal()
classmethod
¶
Z-score for approximately normal factors (e.g. momentum).
for_z_score()
classmethod
¶
Z-score standardization (backward-compatibility alias).
for_per_factor()
classmethod
¶
Per-factor method: RANK_NORMAL for heavy-tailed, Z_SCORE for normal.
Based on MSCI Barra USE4 and Gu/Kelly/Xiu (2020) classification. Heavy-tailed: value ratios, illiquidity, dividend yield, accruals, asset growth. Approximately normal: momentum, volatility, beta.
for_mad_winsorize()
classmethod
¶
MAD-based winsorization (MSCI Barra +/-3 MAD convention).
StandardizationMethod
¶
Bases: str, Enum
Cross-sectional standardization method.
WinsorizeMethod
¶
Bases: str, Enum
Winsorization method for outlier treatment.
FactorPCAResult
dataclass
¶
Principal component analysis result for a factor score matrix.
Attributes¶
explained_variance_ratio : ndarray, shape (n_components,)
Fraction of variance explained by each principal component,
sorted in descending order.
loadings : pd.DataFrame, shape (n_factors, n_components)
PCA loading matrix. Rows are factor names; columns are
PC1, PC2, ... . Each column is a unit eigenvector of
the correlation matrix of the factor scores.
n_components_95pct : int
Smallest number of components whose cumulative explained
variance ratio is ≥ 0.95.
FactorExposureConstraints
dataclass
¶
Enforceable linear constraints on portfolio factor exposure.
Encodes the set of per-factor inequalities::
lb_g <= sum_i w_i * z_{i,g} <= ub_g
as a pair of matrices ready to be passed directly to
:class:skfolio.optimization.MeanRisk (or any optimizer that
accepts left_inequality / right_inequality).
Parameters¶
left_inequality : np.ndarray of shape (2 * n_factors, n_assets)
Inequality matrix A in the constraint A @ w <= b.
Two rows per factor: -z (lower bound) and +z (upper bound).
right_inequality : np.ndarray of shape (2 * n_factors,)
Bound vector b in the constraint A @ w <= b.
factor_names : list[str]
Names of the constrained factors (in the same order as the row
pairs in left_inequality).
lower_bounds : np.ndarray of shape (n_factors,)
Lower exposure bound per factor.
upper_bounds : np.ndarray of shape (n_factors,)
Upper exposure bound per factor.
NetAlphaResult
dataclass
¶
Result of net alpha calculation after transaction cost deduction.
Attributes¶
gross_alpha : float
Annualised IC-based alpha proxy: mean(IC) * sqrt(annualisation).
avg_turnover : float
Mean one-way turnover across consecutive rebalancing dates, computed
via :func:~optimizer.rebalancing._rebalancer.compute_turnover.
total_cost : float
Cost deduction: avg_turnover * cost_bps / 10_000.
net_alpha : float
Net annualised alpha after cost deduction:
gross_alpha - total_cost.
net_icir : float
Net information coefficient information ratio:
net_alpha / (std(IC) * sqrt(annualisation)).
0.0 when the IC series has zero variance.
QuintileSpreadResult
dataclass
¶
Quintile spread analysis result for a single factor.
Attributes¶
quintile_returns : pd.DataFrame
Dates × Q1..Qn equal-weight portfolio returns per quantile bucket.
Q1 = bottom (lowest scores), Qn = top (highest scores).
spread_returns : pd.Series
Qn − Q1 long-short spread return series indexed by date.
Equals quintile_returns.iloc[:, -1] - quintile_returns.iloc[:, 0]
element-wise.
annualised_mean : float
spread_returns.mean() * 252.
t_stat : float
Two-tailed t-statistic: mean / (std / sqrt(T)).
sharpe : float
Annualised Sharpe ratio: mean * sqrt(252) / std.
FactorOOSConfig
dataclass
¶
Configuration for rolling block OOS validation.
Parameters¶
train_periods : int Length of the training window in index periods. Default: 36. val_periods : int Length of the validation window in index periods. Default: 12. step_periods : int Number of index periods to roll forward between folds. Default: 6.
FactorOOSResult
dataclass
¶
Results from rolling block OOS factor validation.
Attributes¶
per_fold_ic : pd.DataFrame
n_folds × factors matrix of mean IC per fold per factor.
per_fold_spread : pd.DataFrame
n_folds × factors matrix of mean quintile spread per fold.
mean_oos_ic : pd.Series
Mean OOS IC aggregated across folds (one value per factor).
mean_oos_icir : pd.Series
OOS ICIR (mean IC / std IC across folds) per factor.
n_folds : int
Number of folds generated.
CompositeICResult
dataclass
¶
IC analysis results for the composite score signal.
Attributes¶
mean_ic : float
Mean IC of the composite score over the evaluation period.
ic_std : float
Standard deviation of the IC series.
t_stat : float
Newey-West adjusted t-statistic.
p_value : float
Two-tailed p-value from the Newey-West t-statistic.
icir : float
IC Information Ratio: mean(IC) / std(IC).
significant : bool
True when abs(t_stat) >= t_stat_threshold.
best_individual_ic : float
Highest mean IC among individual factors. NaN when
no individual factors were validated alongside.
outperforms_best_individual : bool
True when mean_ic > best_individual_ic.
CorrectedPValues
dataclass
¶
Multiple-testing corrected p-values.
Attributes¶
holm : ndarray Holm-Bonferroni adjusted p-values (controls FWER). bh : ndarray Benjamini-Hochberg adjusted p-values (controls FDR).
FactorValidationReport
dataclass
¶
Complete validation report for all factors.
GroupICResult
dataclass
¶
Result of group-level IC aggregation with per-factor breakdown.
Attributes¶
group_ic : pd.DataFrame
(dates x groups) group-level IC history. Identical in shape to
the legacy build_group_ic_history return value.
factor_ic : pd.DataFrame
(dates x factors) per-factor IC time series.
excluded_factors : dict[str, list[str]]
Group name → list of factor names excluded by the negative-IC
filter policy. Empty when ICNegativeFilterPolicy.INCLUDE.
ICResult
dataclass
¶
Information coefficient analysis results for a single factor.
ICStats
dataclass
¶
Full IC statistics for a single factor including Newey-West inference.
Attributes¶
mean : float
Mean IC over the evaluation period.
variance_nw : float
Newey-West HAC variance of the IC series.
t_stat_nw : float
Newey-West adjusted t-statistic: IC_mean / sqrt(Var_NW / T).
p_value : float
Two-tailed p-value derived from the Newey-West t-statistic.
icir : float
Information Coefficient Information Ratio: mean(IC) / std(IC).
QuantileSpreadResult
dataclass
¶
Quantile spread analysis results for a single factor.
compute_gross_alpha(net_alpha, avg_turnover, cost_bps=10.0)
¶
Compute gross alpha by adding back estimated transaction costs.
Formula::
gross = net_alpha + avg_turnover * cost_bps / 10_000
Parameters¶
net_alpha : float Net alpha after transaction costs (annualised). avg_turnover : float Average one-way turnover (e.g. 0.5 means 50% of portfolio traded per period). cost_bps : float One-way transaction cost in basis points.
Returns¶
float Gross alpha before transaction costs.
factor_scores_to_expected_returns(scores, betas, factor_premiums, risk_free_rate=0.0)
¶
Convert factor Z-scores to expected returns via linear model.
Implements the formula::
E[r_i] = r_f + λ_mkt · β_i + Σ_g λ_g · z_{i,g}
where λ_mkt is read from factor_premiums["market"] and each
λ_g is read from factor_premiums[g] for factor group g.
Parameters¶
scores : pd.DataFrame
Assets × factor-groups matrix of standardised Z-scores. Rows are
ticker symbols; columns are factor group names (e.g. "value",
"momentum").
betas : pd.Series
Market (CAPM) beta per asset, indexed by ticker. Assets missing
from this Series are treated as having a beta of 1.0 (market
neutral assumption).
factor_premiums : dict[str, float]
Mapping of premium label → annualised premium (e.g.
{"market": 0.05, "value": 0.03, "momentum": 0.04}). The
reserved "market" key provides λ_mkt; all other keys are
matched against columns in scores.
risk_free_rate : float, default 0.0
Annualised risk-free rate r_f.
Returns¶
pd.Series
Annualised expected return per ticker, indexed by scores.index.
Examples¶
import pandas as pd scores = pd.DataFrame( ... {"value": [1.0, -1.0], "momentum": [0.5, 0.0]}, ... index=["AAPL", "MSFT"], ... ) betas = pd.Series({"AAPL": 1.2, "MSFT": 0.8}) factor_premiums = {"market": 0.05, "value": 0.03, "momentum": 0.04} factor_scores_to_expected_returns(scores, betas, factor_premiums, 0.02) AAPL 0.132 MSFT 0.018 dtype: float64
align_to_pit(data, period_date_col, as_of_date, lag_days, ticker_col='ticker')
¶
Filter time-series data to records published before as_of_date.
A record with period end date D is considered published
lag_days calendar days after D. A record is available as of
as_of_date only when D + lag_days <= as_of_date, equivalently
when D <= as_of_date - lag_days.
For each ticker, the most recent record satisfying the availability
constraint is returned so that callers receive a cross-sectional view
as of as_of_date.
Parameters¶
data : pd.DataFrame
Time-series data containing period_date_col and optionally
ticker_col.
period_date_col : str
Name of the column holding the period end date.
as_of_date : pd.Timestamp or str
The computation date. Only records available on or before this
date (after the lag has elapsed) are returned.
lag_days : int
Calendar days between period end and data availability.
ticker_col : str
Column holding the ticker identifier. Defaults to "ticker".
Returns¶
pd.DataFrame
Cross-sectional view: one row per ticker (the most recent
available record), indexed by ticker_col when present.
Returns an empty DataFrame with the same columns if no records
pass the cutoff.
compute_all_factors(fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None, market_returns=None)
¶
Compute all configured factors.
Parameters¶
fundamentals : pd.DataFrame
Cross-sectional data indexed by ticker.
price_history : pd.DataFrame
Price matrix (dates x tickers).
volume_history : pd.DataFrame or None
Volume matrix.
analyst_data : pd.DataFrame or None
Analyst recommendation data.
insider_data : pd.DataFrame or None
Insider transaction data.
config : FactorConstructionConfig or None
Construction parameters.
market_returns : pd.Series or None
Pre-computed market return series for beta estimation.
See :func:compute_factor for details.
Returns¶
pd.DataFrame Tickers x factors matrix.
compute_factor(factor_type, fundamentals, price_history, volume_history=None, analyst_data=None, insider_data=None, config=None, market_returns=None)
¶
Compute a single factor.
Parameters¶
factor_type : FactorType
Which factor to compute.
fundamentals : pd.DataFrame
Cross-sectional data indexed by ticker.
price_history : pd.DataFrame
Price matrix (dates x tickers).
volume_history : pd.DataFrame or None
Volume matrix (dates x tickers).
analyst_data : pd.DataFrame or None
Analyst recommendation data.
insider_data : pd.DataFrame or None
Insider transaction data.
config : FactorConstructionConfig or None
Construction parameters.
market_returns : pd.Series or None
Pre-computed market return series for beta estimation.
When provided, used as the benchmark instead of the
equal-weight cross-sectional mean. Pass a currency-
consistent broad index (e.g. SPY daily returns) when
price_history spans multiple currency zones.
Returns¶
pd.Series Factor values indexed by ticker.
check_survivorship_bias(returns, final_periods=12, zero_threshold=1e-10)
¶
Check for potential survivorship bias in a return panel.
Survivorship bias occurs when delisted or failed assets are excluded
from the sample. A simple heuristic: if no asset has near-zero
returns in the final final_periods rows (i.e., no asset appears
to have stopped trading), the panel may suffer from survivorship
bias.
Parameters¶
returns : pd.DataFrame Dates × assets return matrix. final_periods : int Number of trailing periods to inspect. zero_threshold : float Absolute threshold below which a return is considered "zero".
Returns¶
bool
True if survivorship bias is suspected, False otherwise.
compute_factor_pca(scores, n_components=None)
¶
Compute PCA on a cross-sectional factor score matrix.
Rows with any NaN are dropped before fitting. Scores are standardised (zero mean, unit variance per factor) so that PCA operates on the correlation structure rather than the covariance structure.
Parameters¶
scores : pd.DataFrame
Tickers × factors matrix of factor scores. Columns are factor
names; rows are asset observations.
n_components : int or None, default None
Number of principal components to retain. None keeps all
components (min(n_samples, n_features)).
Returns¶
FactorPCAResult
See :class:FactorPCAResult for field descriptions.
Raises¶
ValueError If fewer than 2 factors or fewer than 2 observations are available after dropping NaN rows.
flag_redundant_factors(scores, vif_threshold=10.0)
¶
Return factor names whose VIF exceeds vif_threshold.
A VIF above the threshold indicates that the factor's variance is largely explained by the remaining factors, making it a candidate for merging or removal from the composite score.
Parameters¶
scores : pd.DataFrame Tickers × factors matrix of factor scores. Must contain at least 2 factor columns. vif_threshold : float, default 10.0 VIF cutoff above which a factor is considered redundant. Commonly used values: 5 (conservative) or 10 (standard).
Returns¶
list[str]
Factor names with VIF > vif_threshold, in the order they
appear in scores.columns. Empty list if none exceed the
threshold.
Raises¶
ValueError
Propagated from :func:compute_vif if fewer than 2 factors
are provided.
build_factor_bl_views(composite_scores, selected_tickers, config)
¶
Generate Black-Litterman absolute views from composite factor scores.
For each selected ticker with composite score z_i, generates a view::
E[r_i] = (rf + market_premium + z_i * score_premium) / 252
Parameters¶
composite_scores : pd.Series Composite factor scores indexed by ticker. selected_tickers : pd.Index Tickers in the portfolio. config : FactorIntegrationConfig Integration configuration with rf, market premium, and score premium.
Returns¶
tuple[tuple[str, ...], tuple[float, ...]]
(views, confidences) where views are BL-compatible strings
like "AAPL == 0.00045" and confidences are in [0, 1].
build_factor_exposure_constraints(factor_scores, bounds)
¶
Build enforceable linear factor exposure constraints.
For each factor g, the constraint enforces::
lb_g <= sum_i w_i * z_{i,g} <= ub_g
The result is expressed as left_inequality @ w <= right_inequality
(two rows per factor) and can be passed directly to
:class:skfolio.optimization.MeanRisk via its
left_inequality / right_inequality constructor arguments.
Parameters¶
factor_scores : pd.DataFrame
Tickers x factors matrix of standardised factor scores.
The tickers must match the assets used in the optimizer fit.
bounds : tuple[float, float] or dict[str, tuple[float, float]]
Exposure bounds applied to every factor (uniform) when given as a
single (lower, upper) tuple, or per-factor bounds when given as
a dict mapping factor name → (lower, upper).
Returns¶
FactorExposureConstraints
Dataclass holding left_inequality, right_inequality, and
metadata. Pass left_inequality and right_inequality as
keyword arguments to the optimizer.
Warns¶
UserWarning
If the equal-weight portfolio exposure lies outside [lb, ub]
for any factor (i.e. the constraint may be infeasible or very
tight under a balanced allocation).
build_factor_integration(config, composite_scores, standardized_factors, selected_tickers)
¶
Build factor-to-optimizer integration objects.
Depending on config.use_black_litterman, either creates a
Black-Litterman prior from composite scores or builds linear
factor exposure constraints.
Parameters¶
config : FactorIntegrationConfig Integration configuration. composite_scores : pd.Series Composite factor scores indexed by ticker. standardized_factors : pd.DataFrame Standardized factor scores (tickers x factors). selected_tickers : pd.Index Tickers selected for the portfolio.
Returns¶
tuple[BasePrior | None, FactorExposureConstraints | None]
(prior, constraints) — one of the two will be set,
the other None.
compute_net_alpha(ic_series, weights_history, cost_bps=10.0, annualisation=252)
¶
Compute factor net alpha after deducting turnover-based transaction costs.
Combines IC-based gross alpha with the turnover cost from a weights history to produce a single net performance metric::
gross_alpha = mean(IC) * sqrt(annualisation)
avg_turnover = mean one-way turnover across rebalancing dates
total_cost = avg_turnover * cost_bps / 10_000
net_alpha = gross_alpha - total_cost
net_icir = net_alpha / (std(IC) * sqrt(annualisation))
Parameters¶
ic_series : pd.Series Time series of period information coefficients (Spearman or Pearson rank correlation between factor scores and forward returns), one value per rebalancing period. weights_history : pd.DataFrame Portfolio weights at each rebalancing date: rows = dates, columns = assets. Turnover is computed between every pair of consecutive rows. cost_bps : float, default=10.0 Round-trip transaction cost in basis points. annualisation : int, default=252 Number of periods per year (252 for daily, 12 for monthly).
Returns¶
NetAlphaResult
Dataclass with gross_alpha, avg_turnover, total_cost,
net_alpha, and net_icir.
estimate_factor_premia(factor_mimicking_returns)
¶
build_factor_mimicking_portfolios(scores, returns, quantile=0.3, weighting='equal', beta_neutral=False, market_returns=None)
¶
Build long-short factor-mimicking portfolio return time series.
For each date the top quantile fraction of assets (by factor score) are held long and the bottom quantile fraction are held short. The long-short return is the equal- or value-weighted long leg minus the corresponding short leg.
The function handles one factor at a time: scores is a dates × assets DataFrame encoding cross-sectional scores for a single factor. For multiple factors, call once per factor and concatenate the results::
factor_returns = pd.concat(
[
build_factor_mimicking_portfolios(scores_value, returns)
.rename(columns={"factor_return": "value"}),
build_factor_mimicking_portfolios(scores_mom, returns)
.rename(columns={"factor_return": "momentum"}),
],
axis=1,
)
Parameters¶
scores : pd.DataFrame
Dates × assets matrix of cross-sectional factor scores.
Index = dates; columns = asset tickers.
returns : pd.DataFrame
Dates × assets matrix of asset returns, aligned with scores
on the date index. Columns may be a superset or subset of
scores columns; the intersection is used.
quantile : float, default 0.30
Fraction of the asset universe assigned to each leg. Must be
in (0, 0.5].
weighting : {"equal", "value"}, default "equal"
Weighting scheme within each leg.
"equal" — every asset in the leg receives the same weight.
"value" — assets are weighted by the absolute value of
their factor score.
beta_neutral : bool, default False
When True, hedge the long-short portfolio against market
beta exposure. The hedge ratio adjusts the short-leg weight
so that the portfolio beta is approximately zero.
market_returns : pd.Series or None
Market return series, required when beta_neutral=True.
Returns¶
pd.DataFrame
Dates × 1 DataFrame of long-short portfolio returns. Column
name is "factor_return". Index is the intersection of
scores and returns dates. Missing periods (fewer than
2 * k valid observations) are filled with NaN.
Raises¶
ValueError
If quantile is outside (0, 0.5] or weighting is unknown.
compute_cross_factor_correlation(factor_returns)
¶
Compute the Pearson correlation matrix across factor-mimicking portfolios.
Parameters¶
factor_returns : pd.DataFrame
Dates × factors DataFrame of long-short factor returns, as
returned by build_factor_mimicking_portfolios (possibly
concatenated across multiple factors).
Returns¶
pd.DataFrame Factors × factors symmetric correlation matrix. Diagonal entries are exactly 1.0. Computed on the rows where all factors have non-NaN returns (pairwise-complete otherwise).
compute_quintile_spread(scores, returns, n_quantiles=5)
¶
Compute quintile portfolio returns and spread for a single factor.
At each date assets are ranked by factor score and split into n_quantiles equal-count buckets (Q1 = lowest scores, Qn = highest). Each bucket return is the equal-weight average of its members. The long-short spread is Qn − Q1.
Ties in scores are broken by rank order (method="first"), ensuring
every bucket is populated at every date.
Parameters¶
scores : pd.DataFrame Dates × assets matrix of cross-sectional factor scores. returns : pd.DataFrame Dates × assets matrix of asset returns, aligned with scores. n_quantiles : int, default 5 Number of equal-count buckets. 5 = quintiles, 10 = deciles. Must be ≥ 2.
Returns¶
QuintileSpreadResult
See :class:QuintileSpreadResult for field descriptions.
Raises¶
ValueError If n_quantiles < 2.
fit_gbt_composite(scores, forward_returns, max_depth=3, n_estimators=50, random_state=0)
¶
Fit a gradient-boosted tree model mapping factor scores to forward returns.
Parameters¶
scores : pd.DataFrame Historical tickers x factors matrix (training observations). forward_returns : pd.Series Forward return per ticker for the training period. max_depth : int Maximum depth of individual regression trees (3-5 recommended to limit extrapolation and retain interpretability). n_estimators : int Number of boosting rounds. random_state : int Random state for reproducibility.
Returns¶
GradientBoostingRegressor Fitted GBT model.
fit_ridge_composite(scores, forward_returns, alpha=1.0)
¶
Fit a ridge regression model mapping factor scores to forward returns.
Parameters¶
scores : pd.DataFrame
Historical tickers x factors matrix (training observations).
Must be aligned with forward_returns on the index.
forward_returns : pd.Series
Forward return per ticker for the training period.
alpha : float
L2 regularisation strength. A single-element array is passed to
RidgeCV so cross-validation still runs internally if multiple
alphas are desired; here we keep one alpha for determinism.
Returns¶
RidgeCV
Fitted ridge model. Call predict(scores) for new data.
predict_composite_scores(model, scores)
¶
Apply a fitted ridge or GBT model to produce normalised composite scores.
The raw predictions are standardised to zero mean and unit variance so the output is on the same scale as z-score factor inputs.
Parameters¶
model : RidgeCV or GradientBoostingRegressor
A model returned by :func:fit_ridge_composite or
:func:fit_gbt_composite.
scores : pd.DataFrame
Current-period tickers x factors matrix.
Returns¶
pd.Series
Normalised composite score per ticker (zero mean, unit variance).
Tickers with all-NaN factor rows receive NaN.
run_factor_oos_validation(scores, returns, config=None, cpcv_config=None)
¶
Rolling block or CPCV out-of-sample validation of factor IC and spreads.
Parameters¶
scores : pd.DataFrame
Panel of standardised factor scores with a two-level row MultiIndex
(date, ticker) and one column per factor.
returns : pd.DataFrame
Forward returns panel with the same (date, ticker) MultiIndex
and a single return column.
config : FactorOOSConfig or None
Rolling window parameters. Defaults to FactorOOSConfig().
Ignored when cpcv_config is provided.
cpcv_config : CPCVConfig or None
When provided, uses combinatorial purged cross-validation
instead of rolling blocks. Overrides config.
Returns¶
FactorOOSResult Per-fold and aggregate IC and quintile spread statistics.
Notes¶
The validation window computation uses only val-window dates; no
training-window data is used. Fold count equals
floor((total_periods - train_periods) / step_periods) for rolling,
or C(n_folds, n_test_folds) for CPCV.
apply_regime_tilts(group_weights, regime, config=None)
¶
Apply regime-conditional multiplicative tilts to group weights.
Parameters¶
group_weights : dict[FactorGroupType, float] Base group weights. regime : MacroRegime Current macro regime. config : RegimeTiltConfig or None Tilt configuration.
Returns¶
dict[FactorGroupType, float] Tilted group weights (re-normalized to sum to original total).
Notes¶
The bounding sequence is:
- Look up raw tilts from
get_regime_tilts. - Clamp each multiplier to
[0, config.max_tilt_multiplier]. - Apply clamped multiplier to each group weight.
- Floor each result to
config.min_post_tilt_weight * totalso no group is compressed to near-zero. - Re-normalize to preserve the original total weight.
check_regime_disagreement(regime_a, regime_b, label_a='composite', label_b='hmm')
¶
Check whether two regime classifications disagree.
When the macro-indicator and HMM-based (or any two) regime systems
produce different classifications, this function logs a WARNING
and returns True. Returns False when they agree.
Parameters¶
regime_a, regime_b : MacroRegime Regime classifications from two different subsystems. label_a, label_b : str Human-readable labels for the two sources (used in the log message).
Returns¶
bool
True if the regimes disagree, False otherwise.
classify_regime(macro_data, thresholds=None)
¶
Classify the current macro-economic regime.
Uses a simple heuristic based on GDP growth and leading indicators. The regime is determined by the latest observation's position relative to trend.
When richer indicators (pmi, spread_2s10s, hy_oas)
are present, delegates to :func:classify_regime_composite.
Parameters¶
macro_data : pd.DataFrame
Macro indicators with columns that may include
gdp_growth, leading_indicator, yield_spread,
unemployment_rate. Index is date.
thresholds : RegimeThresholdConfig or None
Scoring thresholds forwarded to the composite classifier.
Returns¶
MacroRegime Current regime classification.
classify_regime_composite(macro_data, thresholds=None)
¶
Classify macro regime using the multi-indicator composite score.
Uses ISM PMI, 2s10s yield curve spread, and HY credit spread to compute a composite score S_t as defined in the macroeconomic analysis framework (Chapter 7).
The input DataFrame should contain any of these columns:
pmi (Manufacturing PMI), spread_2s10s (10Y-2Y spread in %),
hy_oas (HY OAS in basis points), sentiment (news score).
Parameters¶
macro_data : pd.DataFrame Macro indicators indexed by date. thresholds : RegimeThresholdConfig or None Scoring thresholds. Defaults to the empirical calibration.
Returns¶
MacroRegime Regime classification based on composite score.
get_regime_tilts(regime, config=None)
¶
compute_composite_score(standardized_factors, coverage, config=None, ic_history=None, training_scores=None, training_returns=None, group_weights=None)
¶
Compute composite score from standardized factors.
Parameters¶
standardized_factors : pd.DataFrame
Tickers x factors matrix.
coverage : pd.DataFrame
Boolean coverage matrix.
config : CompositeScoringConfig or None
Scoring configuration.
ic_history : pd.DataFrame or None
Required when config.method is IC_WEIGHTED or
ICIR_WEIGHTED. Columns must match group names; each column
is treated as the IC time series for that group.
training_scores : pd.DataFrame or None
Required when config.method is RIDGE_WEIGHTED or
GBT_WEIGHTED. Historical tickers x factors matrix used to
train the ML model (must not overlap with current-period data).
training_returns : pd.Series or None
Required when config.method is RIDGE_WEIGHTED or
GBT_WEIGHTED. Forward returns aligned with training_scores.
group_weights : dict[str, float] or None
Pre-computed group weights (e.g. from regime tilts). Threaded
through to the inner scoring functions.
Returns¶
pd.Series or pd.DataFrame
Composite score per ticker. When config.return_coverage is
True, returns a DataFrame with composite and coverage_ratio
columns.
compute_equal_weight_composite(group_scores, config=None, group_weights=None)
¶
Equal-weight composite with core/supplementary tiering.
Parameters¶
group_scores : pd.DataFrame Tickers x groups matrix. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, skip tier-based derivation and use these weights directly.
Returns¶
pd.Series Composite score per ticker.
compute_group_scores(standardized_factors, coverage)
¶
compute_ic_weighted_composite(group_scores, ic_history, config=None, group_weights=None)
¶
IC-weighted composite score.
Uses trailing information coefficient history to weight groups.
Parameters¶
group_scores : pd.DataFrame Tickers x groups matrix. ic_history : pd.DataFrame Periods x groups matrix of IC values. config : CompositeScoringConfig or None Scoring configuration. group_weights : dict[str, float] or None Pre-computed group weights (e.g. from regime tilts). When provided, use as tier multipliers instead of config core/supplementary weights.
Returns¶
pd.Series Composite score per ticker.
compute_icir_weighted_composite(group_scores, ic_series_per_group, config=None, group_weights=None)
¶
ICIR-weighted composite score.
Weights each group by max(ICIR, 0) = max(mean(IC) / std(IC), 0),
normalised to sum to 1. Groups with zero, negative, or undefined ICIR
receive zero weight. Falls back to equal-weight when all groups have
ICIR <= 0.
Parameters¶
group_scores : pd.DataFrame
Tickers x groups matrix.
ic_series_per_group : dict[str, pd.Series]
Per-group IC time series. Keys must match group_scores columns.
config : CompositeScoringConfig or None
Scoring configuration.
group_weights : dict[str, float] or None
Pre-computed group weights (e.g. from regime tilts). When provided,
use as tier multipliers instead of config core/supplementary weights.
Returns¶
pd.Series Composite score per ticker.
compute_ml_composite(standardized_factors, training_scores, training_returns, config)
¶
ML composite score using ridge regression or gradient-boosted trees.
Trains the model on historical (training_scores, training_returns)
and predicts on the current-period standardized_factors. The
prediction is normalised to zero mean and unit variance.
The training window must end strictly before the prediction date to avoid look-ahead bias; callers are responsible for this temporal split.
Parameters¶
standardized_factors : pd.DataFrame
Current-period tickers x factors matrix (prediction target).
training_scores : pd.DataFrame
Historical tickers x factors matrix aligned with
training_returns.
training_returns : pd.Series
Forward return per ticker for the training period.
config : CompositeScoringConfig
Must have method set to RIDGE_WEIGHTED or GBT_WEIGHTED.
Returns¶
pd.Series Normalised composite score per ticker (zero mean, unit variance).
apply_sector_balance(selected, scores, sector_labels, parent_universe, tolerance=0.05)
¶
Adjust selection for sector-proportional representation.
Iterates the balance pass until convergence (no further adds or
removes are needed) or until _MAX_BALANCE_ITERATIONS is reached.
A warning is logged if the cap is hit before convergence.
Parameters¶
selected : pd.Index Initially selected tickers. scores : pd.Series Composite scores for all candidates. sector_labels : pd.Series Sector label per ticker. parent_universe : pd.Index Full universe for computing target sector weights. tolerance : float Maximum deviation from parent sector weights.
Returns¶
pd.Index Sector-balanced selection.
compute_selection_turnover(current, new, universe)
¶
select_fixed_count(scores, target_count, buffer_fraction=0.1, current_members=None)
¶
Select top N stocks by composite score with buffer.
Parameters¶
scores : pd.Series
Composite scores indexed by ticker.
target_count : int
Target number of stocks.
buffer_fraction : float
Buffer as a fraction of target_count. Current members within
the buffer zone are retained in preference to the lowest-ranked
direct entrants, but the returned index always contains exactly
min(len(valid_scores), target_count) tickers.
current_members : pd.Index or None
Tickers currently selected.
Returns¶
pd.Index
Selected tickers. Length is always
min(len(scores.dropna()), target_count).
select_quantile(scores, target_quantile=0.8, exit_quantile=None, current_members=None)
¶
Select stocks above a quantile threshold.
Parameters¶
scores : pd.Series
Composite scores indexed by ticker.
target_quantile : float
Quantile threshold for entry (0-1).
exit_quantile : float or None
Quantile threshold for exit (hysteresis). If None,
uses target_quantile.
current_members : pd.Index or None
Currently selected tickers.
Returns¶
pd.Index Selected tickers.
select_stocks(scores, config=None, current_members=None, sector_labels=None, parent_universe=None, return_turnover=False)
¶
Select stocks from scored universe.
Parameters¶
scores : pd.Series
Composite scores indexed by ticker.
config : SelectionConfig or None
Selection configuration.
current_members : pd.Index or None
Currently selected tickers for buffer/hysteresis.
sector_labels : pd.Series or None
Sector labels for sector balancing.
parent_universe : pd.Index or None
Full universe for sector weight targets.
return_turnover : bool
When True, return (selected, turnover) tuple.
Returns¶
pd.Index or tuple[pd.Index, float] Selected tickers, optionally with turnover.
neutralize_sector(scores, sector_labels, country_labels=None)
¶
Demean scores within each sector (and optionally country).
Parameters¶
scores : pd.Series Standardized factor scores. sector_labels : pd.Series Sector label per ticker. country_labels : pd.Series or None Country label per ticker for country neutralization.
Returns¶
pd.Series Sector-neutralized scores.
orthogonalize_factors(factor_scores, method='pca', min_variance_explained=0.95)
¶
Project factor scores onto orthogonal principal components.
Eliminates multicollinearity among factor scores by projecting
them into a lower-dimensional PCA space. Retains the minimum
number of components that explain at least min_variance_explained
of the total variance.
Parameters¶
factor_scores : pd.DataFrame
Tickers × factors matrix of factor scores.
method : str
Projection method. Only "pca" is supported.
min_variance_explained : float
Minimum cumulative explained variance ratio for retained
components. Must be in (0, 1].
Returns¶
pd.DataFrame
Tickers × PCs matrix with columns named PC1, PC2, ....
Rows with NaN in the input are filled with NaN in the output
but otherwise preserve the original index.
Raises¶
ConfigurationError
If method is not "pca".
DataError
If fewer than 2 factors or fewer than 2 non-NaN observations.
rank_normal_standardize(scores)
¶
standardize_all_factors(raw_factors, config=None, sector_labels=None, country_labels=None)
¶
Standardize all factors and compute coverage.
Parameters¶
raw_factors : pd.DataFrame Tickers x factors matrix of raw values. config : StandardizationConfig or None Standardization parameters. sector_labels : pd.Series or None Sector labels for neutralization. country_labels : pd.Series or None Country labels for neutralization.
Returns¶
tuple[pd.DataFrame, pd.DataFrame] (standardized_scores, coverage) where coverage is a boolean DataFrame indicating non-NaN values.
standardize_factor(raw_scores, config=None, sector_labels=None, country_labels=None, *, factor_name='')
¶
Full standardization pipeline for a single factor.
Parameters¶
raw_scores : pd.Series
Raw factor values.
config : StandardizationConfig or None
Standardization parameters.
sector_labels : pd.Series or None
Sector labels for neutralization.
country_labels : pd.Series or None
Country labels for neutralization.
factor_name : str
Column name of the factor, used to look up per-factor method
overrides in config.factor_method_overrides and the
FACTOR_DIRECTION sign convention.
Returns¶
pd.Series Standardized factor scores.
winsorize_cross_section(scores, lower_pct=0.01, upper_pct=0.99)
¶
winsorize_cross_section_mad(scores, mad_multiplier=3.0)
¶
Clip scores using Median Absolute Deviation (MAD).
Uses the normal-consistent scale factor 1.4826 * MAD to set clip
boundaries at median +/- mad_multiplier * scale, following the
MSCI Barra USE4 convention (+/-3 MAD).
Parameters¶
scores : pd.Series Raw factor scores (may contain NaN). mad_multiplier : float Number of scaled-MAD units for clip boundaries.
Returns¶
pd.Series Winsorized scores.
z_score_standardize(scores)
¶
benjamini_hochberg(p_values, alpha=0.05)
¶
compute_composite_ic(composite_scores_history, returns_history, newey_west_lags=6, t_stat_threshold=2.0, min_observations=3)
¶
Compute IC statistics for the composite score signal.
Parameters¶
composite_scores_history : pd.DataFrame Dates x tickers matrix of composite scores. returns_history : pd.DataFrame Dates x tickers matrix of forward returns. newey_west_lags : int, default 6 Number of lags for HAC standard errors. t_stat_threshold : float, default 2.0 Threshold for significance decision. min_observations : int, default 3 Minimum non-NaN observations per cross-section date.
Returns¶
CompositeICResult
IC statistics for the composite score. The
best_individual_ic and outperforms_best_individual
fields are populated by run_factor_validation when
individual factor results are available.
compute_ic_series(factor_scores_history, returns_history, factor_name, min_observations=3)
¶
Compute IC time series for a factor.
Parameters¶
factor_scores_history : pd.DataFrame
Dates x tickers matrix of factor scores.
returns_history : pd.DataFrame
Dates x tickers matrix of forward returns.
factor_name : str
Used only for labeling.
min_observations : int, default 3
Minimum number of common non-NaN observations per date.
Passed through to compute_monthly_ic.
Returns¶
pd.Series IC values indexed by date.
compute_ic_stats(ic_series, lags=5)
¶
Compute full IC statistics including Newey-West t-stat and ICIR.
Parameters¶
ic_series : pd.Series Time series of IC values (one per cross-section date). lags : int Number of lags for Newey-West HAC standard errors.
Returns¶
ICStats
Dataclass containing mean, variance_nw, t_stat_nw,
p_value, and icir.
compute_icir(ic_series)
¶
Compute the IC Information Ratio (mean IC / std IC).
ICIR penalises factors with high average IC but also high IC volatility (inconsistent predictors). Use this as the weighting signal in ICIR-weighted composite scoring.
Parameters¶
ic_series : pd.Series Time series of IC values (one per cross-section date).
Returns¶
float
ICIR value, or 0.0 if std(IC) == 0 or fewer than
2 non-NaN observations.
compute_monthly_ic(factor_scores, forward_returns, min_observations=3)
¶
Compute rank information coefficient (Spearman correlation).
Parameters¶
factor_scores : pd.Series Cross-sectional factor scores. forward_returns : pd.Series Forward returns for the same tickers. min_observations : int, default 3 Minimum number of common non-NaN observations required. Returns NaN if fewer are available.
Returns¶
float Rank IC (Spearman correlation).
compute_newey_west_tstat(ic_series, n_lags=6)
¶
compute_quantile_spread(factor_scores, forward_returns, n_quantiles=5)
¶
compute_vif(factor_matrix)
¶
correct_pvalues(p_values, alpha=0.05)
¶
Apply Holm-Bonferroni and Benjamini-Hochberg multiple testing corrections.
Parameters¶
p_values : ndarray, shape (m,)
Raw p-values in any order.
alpha : float
Significance level used to compute the adjustments (does not filter
here; callers compare adjusted p-values against alpha).
Returns¶
CorrectedPValues
holm — FWER-controlling Holm-Bonferroni adjusted p-values.
bh — FDR-controlling Benjamini-Hochberg adjusted p-values.
Both arrays are returned in the same order as the input.
run_factor_validation(factor_scores_history, returns_history, config=None, composite_scores_history=None)
¶
Run complete factor validation suite.
Parameters¶
factor_scores_history : dict[str, pd.DataFrame] Factor name -> (dates x tickers) score history. returns_history : pd.DataFrame Dates x tickers forward return matrix. config : FactorValidationConfig or None Validation parameters. composite_scores_history : pd.DataFrame or None Dates x tickers matrix of composite scores. When provided, IC analysis is run on the composite signal and compared against the best individual factor IC.
Returns¶
FactorValidationReport Complete validation results.
validate_factor_universe(ic_matrix, lags=5, alpha=0.05)
¶
Validate all factors simultaneously with multiple testing correction.
Parameters¶
ic_matrix : pd.DataFrame Dates × factors matrix of IC values (one IC per period per factor). lags : int Number of Newey-West HAC lags. alpha : float Significance level for both FWER and FDR rejection decisions.
Returns¶
pd.DataFrame
Factor × statistic summary with columns:
ic_mean, icir, t_stat_nw, p_value_raw,
p_value_holm, p_value_bh, significant_holm,
significant_bh.