Factor Research¶
Comprehensive guide to the factors module. This module provides a complete factor research pipeline from raw fundamentals to optimization-ready inputs, covering 17 individual factors across 9 factor groups. Every component follows the same pattern: frozen @dataclass config + factory function + str, Enum types.
Pipeline Overview¶
The factor pipeline is a sequential workflow where each stage transforms the output of the previous one:
fundamentals --> construction --> standardization --> scoring -->
selection --> regime tilts --> validation --> integration
| Stage | Input | Output | Key Function |
|---|---|---|---|
| Construction | Fundamentals, prices, volume | Raw factor scores (pd.DataFrame) |
compute_all_factors() |
| Standardization | Raw scores, sector labels | Standardized scores + coverage | standardize_all_factors() |
| Scoring | Standardized scores, IC history | Composite score per ticker (pd.Series) |
compute_composite_score() |
| Selection | Composite scores | Selected tickers (pd.Index) |
select_stocks() |
| Regime Tilts | Group weights, macro data | Tilted group weights | apply_regime_tilts() |
| Validation | Score history, return history | FactorValidationReport |
run_factor_validation() |
| Integration | Scores, premia, weights | Constraints, views, net alpha | build_factor_exposure_constraints() |
Factor Taxonomy¶
FactorType (17 factors)¶
Each factor is computed from one of four data sources: fundamental data, price history, volume history, or alternative data (analyst/insider).
| Factor | Enum Value | Group | Data Source | Formula |
|---|---|---|---|---|
| Book-to-Price | BOOK_TO_PRICE |
Value | Fundamentals | book_value / market_cap |
| Earnings Yield | EARNINGS_YIELD |
Value | Fundamentals | net_income / market_cap |
| Cash Flow Yield | CASH_FLOW_YIELD |
Value | Fundamentals | operating_cashflow / market_cap |
| Sales-to-Price | SALES_TO_PRICE |
Value | Fundamentals | total_revenue / market_cap |
| EBITDA-to-EV | EBITDA_TO_EV |
Value | Fundamentals | ebitda / enterprise_value |
| Gross Profitability | GROSS_PROFITABILITY |
Profitability | Fundamentals | gross_profit / total_assets (Novy-Marx) |
| ROE | ROE |
Profitability | Fundamentals | net_income / total_equity |
| Operating Margin | OPERATING_MARGIN |
Profitability | Fundamentals | operating_income / total_revenue |
| Profit Margin | PROFIT_MARGIN |
Profitability | Fundamentals | net_income / total_revenue |
| Asset Growth | ASSET_GROWTH |
Investment | Fundamentals | -YoY total asset growth (sign-flipped) |
| Momentum (12-1) | MOMENTUM_12_1 |
Momentum | Prices | 12-month return skipping most recent month |
| Volatility | VOLATILITY |
Low Risk | Prices | -annualized std (sign-flipped, lower = better) |
| Beta | BETA |
Low Risk | Prices | -market beta (sign-flipped, lower = better) |
| Amihud Illiquidity | AMIHUD_ILLIQUIDITY |
Liquidity | Prices + Volume | avg(|return| / dollar_volume) |
| Dividend Yield | DIVIDEND_YIELD |
Dividend | Fundamentals | trailing annual dividend yield |
| Recommendation Change | RECOMMENDATION_CHANGE |
Sentiment | Analyst data | net upgrades - downgrades |
| Net Insider Buying | NET_INSIDER_BUYING |
Ownership | Insider data | purchases - sales (shares) |
Sign Conventions
Volatility, beta, and asset growth are sign-flipped so that higher values always indicate a more favorable factor exposure. For volatility and beta, lower raw values are better (less risk), so the sign is negated. For asset growth, conservative investment (lower growth) is favorable per the Hou-Xue-Zhang investment factor, so the sign is negated.
FactorGroupType (9 groups)¶
Factors are organized into groups for hierarchical aggregation during composite scoring.
| Group | Enum Value | Weight Tier | Member Factors |
|---|---|---|---|
| Value | VALUE |
CORE | BOOK_TO_PRICE, EARNINGS_YIELD, CASH_FLOW_YIELD, SALES_TO_PRICE, EBITDA_TO_EV |
| Profitability | PROFITABILITY |
CORE | GROSS_PROFITABILITY, ROE, OPERATING_MARGIN, PROFIT_MARGIN |
| Momentum | MOMENTUM |
CORE | MOMENTUM_12_1 |
| Low Risk | LOW_RISK |
CORE | VOLATILITY, BETA |
| Investment | INVESTMENT |
SUPPLEMENTARY | ASSET_GROWTH |
| Liquidity | LIQUIDITY |
SUPPLEMENTARY | AMIHUD_ILLIQUIDITY |
| Dividend | DIVIDEND |
SUPPLEMENTARY | DIVIDEND_YIELD |
| Sentiment | SENTIMENT |
SUPPLEMENTARY | RECOMMENDATION_CHANGE |
| Ownership | OWNERSHIP |
SUPPLEMENTARY | NET_INSIDER_BUYING |
The GROUP_WEIGHT_TIER mapping assigns each group to either CORE or SUPPLEMENTARY. Core groups receive core_weight (default 1.0) and supplementary groups receive supplementary_weight (default 0.5) during composite scoring, reflecting the stronger empirical evidence behind core factors.
1. Construction¶
Factor construction computes raw factor scores from fundamentals, prices, volume, analyst data, and insider data. All construction respects point-in-time alignment to prevent look-ahead bias.
FactorConstructionConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
factors |
tuple[FactorType, ...] |
8 core factors | Which factors to compute |
momentum_lookback |
int |
252 |
Lookback window for momentum (trading days) |
momentum_skip |
int |
21 |
Recent days to skip for momentum (reversal avoidance) |
volatility_lookback |
int |
252 |
Lookback window for volatility (trading days) |
beta_lookback |
int |
252 |
Lookback window for beta estimation (trading days) |
amihud_lookback |
int |
252 |
Lookback window for Amihud illiquidity (trading days) |
publication_lag |
PublicationLagConfig |
Default lags | Per-source publication lags for PIT correctness |
The default factors tuple includes: BOOK_TO_PRICE, EARNINGS_YIELD, GROSS_PROFITABILITY, ROE, ASSET_GROWTH, MOMENTUM_12_1, VOLATILITY, DIVIDEND_YIELD.
Presets¶
from optimizer.factors import FactorConstructionConfig
# Core factors with strongest empirical support (8 factors, default)
config = FactorConstructionConfig.for_core_factors()
# All 17 factors
config = FactorConstructionConfig.for_all_factors()
PublicationLagConfig¶
Differentiated publication lags prevent look-ahead bias by ensuring that data is only used after it would realistically have been available.
| Field | Type | Default | Description |
|---|---|---|---|
annual_days |
int |
90 |
Lag for annual financial statements (10-K filing) |
quarterly_days |
int |
45 |
Lag for quarterly financial statements (10-Q filing) |
analyst_days |
int |
5 |
Lag for analyst estimates and recommendations |
macro_days |
int |
63 |
Lag for macroeconomic indicators (release + revision lag) |
from optimizer.factors import PublicationLagConfig
# Uniform lag across all sources
lag = PublicationLagConfig.uniform(days=60)
# Custom per-source lags
lag = PublicationLagConfig(
annual_days=120,
quarterly_days=60,
analyst_days=2,
macro_days=45,
)
Backward Compatibility
FactorConstructionConfig accepts a plain int for publication_lag, which is automatically converted to PublicationLagConfig.uniform(int_value).
Point-in-Time Alignment¶
The align_to_pit() function filters time-series data to records that would have been published on or before a given computation date. For each ticker, it returns the most recent available record.
from optimizer.factors import align_to_pit
# Get the most recent fundamentals available as of 2024-06-30,
# accounting for a 90-day publication lag
pit_data = align_to_pit(
data=fundamentals_df,
period_date_col="fiscal_period_end",
as_of_date="2024-06-30",
lag_days=90,
ticker_col="ticker",
)
A record with period end date D is considered published lag_days calendar days after D. The function returns a cross-sectional view (one row per ticker) containing only the latest record for which D + lag_days <= as_of_date.
Computing Factors¶
from optimizer.factors import compute_all_factors, compute_factor, FactorConstructionConfig, FactorType
# Compute all configured factors at once
config = FactorConstructionConfig.for_all_factors()
raw_factors = compute_all_factors(
fundamentals=fundamentals_df, # Cross-sectional, indexed by ticker
price_history=price_df, # Dates x tickers matrix
volume_history=volume_df, # Dates x tickers matrix
analyst_data=analyst_df, # Optional
insider_data=insider_df, # Optional
config=config,
)
# raw_factors: pd.DataFrame with tickers as rows, factor names as columns
# Compute a single factor
momentum = compute_factor(
factor_type=FactorType.MOMENTUM_12_1,
fundamentals=fundamentals_df,
price_history=price_df,
config=config,
)
Data Requirements
fundamentalsmust be a cross-sectional DataFrame indexed by ticker with columns matching the factor formulas (e.g.,market_cap,book_value,net_income).price_historymust be a dates x tickers DataFrame. Momentum requires at leastmomentum_lookbackrows of data.volume_historyis only required forAMIHUD_ILLIQUIDITY. IfNone, that factor returns an empty Series.analyst_datais only required forRECOMMENDATION_CHANGE. It must contain either arecommendation_changecolumn orstrong_buy/buy/sell/strong_sellcounts.insider_datais only required forNET_INSIDER_BUYING. It must containshares,ticker, and optionallytransaction_typecolumns.
2. Standardization¶
Cross-sectional standardization transforms raw factor scores into comparable, well-behaved distributions suitable for aggregation. The pipeline is: winsorize --> z-score or rank-normal --> sector neutralize --> optional re-standardization.
StandardizationConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
method |
StandardizationMethod |
Z_SCORE |
Z-score or rank-normal standardization |
winsorize_lower |
float |
0.01 |
Lower percentile for winsorization (0-1) |
winsorize_upper |
float |
0.99 |
Upper percentile for winsorization (0-1) |
neutralize_sector |
bool |
True |
Whether to sector-neutralize scores |
neutralize_country |
bool |
False |
Whether to country-neutralize scores |
re_standardize_after_neutralization |
bool |
False |
Re-apply z-score after neutralization |
StandardizationMethod¶
| Value | Description | Best For |
|---|---|---|
Z_SCORE |
(x - mean) / std |
Approximately normal factors (e.g., momentum) |
RANK_NORMAL |
Phi^-1((rank - 0.5) / N) inverse normal transform |
Heavy-tailed distributions (e.g., value ratios) |
Presets¶
from optimizer.factors import StandardizationConfig
# Rank-normal for heavy-tailed distributions (value ratios, illiquidity)
config = StandardizationConfig.for_heavy_tailed()
# Z-score for approximately normal factors (momentum, profitability)
config = StandardizationConfig.for_normal()
Standardization Pipeline Steps¶
Step 1: Winsorize¶
from optimizer.factors import winsorize_cross_section
# Clip extremes at the 1st and 99th percentiles
clipped = winsorize_cross_section(raw_scores, lower_pct=0.01, upper_pct=0.99)
Step 2: Z-Score or Rank-Normal¶
from optimizer.factors import z_score_standardize, rank_normal_standardize
# Z-score: mean 0, std 1
z_scored = z_score_standardize(clipped)
# Rank-normal: maps ranks to normal distribution, robust to outliers
rank_normed = rank_normal_standardize(clipped)
Step 3: Sector Neutralize¶
from optimizer.factors import neutralize_sector
# Demean scores within each sector
neutral = neutralize_sector(
scores=z_scored,
sector_labels=sector_series, # pd.Series: ticker -> sector
country_labels=country_series, # Optional: ticker -> country
)
Sector neutralization removes sector-level biases so that the factor captures stock-level characteristics rather than sector membership. When both neutralize_sector and neutralize_country are enabled, the function creates sector-country interaction groups (e.g., "Technology_US") and demeans within each.
Full Standardization¶
from optimizer.factors import standardize_all_factors, StandardizationConfig
config = StandardizationConfig(
method=StandardizationMethod.RANK_NORMAL,
neutralize_sector=True,
)
standardized, coverage = standardize_all_factors(
raw_factors=raw_factors, # Tickers x factors DataFrame
config=config,
sector_labels=sector_series, # pd.Series: ticker -> sector
)
# standardized: pd.DataFrame of standardized scores
# coverage: pd.DataFrame (boolean) indicating non-NaN values
PCA Orthogonalization¶
For eliminating multicollinearity among factor scores, orthogonalize_factors() projects the scores onto principal components:
from optimizer.factors import orthogonalize_factors
# Retain components explaining >= 95% of variance
orthogonal = orthogonalize_factors(
factor_scores=standardized,
method="pca",
min_variance_explained=0.95,
)
# orthogonal: pd.DataFrame with columns PC1, PC2, ...
Orthogonalization Limitations
- Only
"pca"is supported as the method. Other values raiseConfigurationError. - Requires at least 2 factors and 2 non-NaN observations.
- Rows with NaN in the input produce NaN in the output but preserve the index.
- After orthogonalization, factor scores lose their economic interpretation (they become statistical principal components).
3. Composite Scoring¶
Composite scoring aggregates standardized factor scores into a single composite score per ticker. The process is hierarchical: factors are first averaged within their group, then group scores are combined using configurable weighting schemes.
CompositeScoringConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
method |
CompositeMethod |
EQUAL_WEIGHT |
Scoring method |
ic_lookback |
int |
36 |
Number of periods for IC estimation (IC/ICIR methods) |
core_weight |
float |
1.0 |
Relative weight for CORE factor groups |
supplementary_weight |
float |
0.5 |
Relative weight for SUPPLEMENTARY factor groups |
ridge_alpha |
float |
1.0 |
L2 regularization strength for RIDGE_WEIGHTED |
gbt_max_depth |
int |
3 |
Maximum tree depth for GBT_WEIGHTED |
gbt_n_estimators |
int |
50 |
Number of boosting rounds for GBT_WEIGHTED |
CompositeMethod¶
| Method | Description | Requirements | Strengths |
|---|---|---|---|
EQUAL_WEIGHT |
Core/supplementary tiered equal weighting | None | Robust, no estimation error |
IC_WEIGHTED |
Trailing IC magnitude as weights | ic_history |
Adapts to recent predictive power |
ICIR_WEIGHTED |
\|mean(IC) / std(IC)\| as weights |
ic_history |
Penalizes inconsistent predictors |
RIDGE_WEIGHTED |
Ridge regression on historical returns | training_scores, training_returns |
Captures linear factor interactions |
GBT_WEIGHTED |
Gradient-boosted trees on historical returns | training_scores, training_returns |
Captures non-linear interactions |
Presets¶
from optimizer.factors import CompositeScoringConfig
config = CompositeScoringConfig.for_equal_weight()
config = CompositeScoringConfig.for_ic_weighted()
config = CompositeScoringConfig.for_icir_weighted()
config = CompositeScoringConfig.for_ridge_weighted()
config = CompositeScoringConfig.for_gbt_weighted()
Scoring Workflow¶
Step 1: Compute Group Scores¶
Group scores are the coverage-weighted mean of factor scores within each group:
from optimizer.factors import compute_group_scores
group_scores = compute_group_scores(standardized, coverage)
# group_scores: pd.DataFrame with tickers as rows, group names as columns
Step 2: Compute Composite Score¶
from optimizer.factors import compute_composite_score, CompositeScoringConfig
# Equal-weight composite (simplest)
composite = compute_composite_score(
standardized_factors=standardized,
coverage=coverage,
)
# IC-weighted composite (requires IC history)
config = CompositeScoringConfig.for_ic_weighted()
composite = compute_composite_score(
standardized_factors=standardized,
coverage=coverage,
config=config,
ic_history=ic_df, # Periods x groups DataFrame of IC values
)
# ML composite (requires training data)
config = CompositeScoringConfig.for_ridge_weighted()
composite = compute_composite_score(
standardized_factors=standardized,
coverage=coverage,
config=config,
training_scores=historical_scores, # Historical tickers x factors
training_returns=forward_returns, # Forward return per ticker
)
Look-Ahead Bias in ML Scoring
For RIDGE_WEIGHTED and GBT_WEIGHTED, the training window must end strictly before the prediction date. The caller is responsible for ensuring temporal separation between training_scores and the current-period standardized_factors.
IC-Weighted Scoring Details¶
The IC-weighted method uses trailing Information Coefficient (Spearman rank correlation between factor scores and forward returns) to dynamically weight factor groups:
- Compute the mean IC over the trailing
ic_lookbackperiods for each group - Clamp negative ICs to zero (negative-IC groups should not contribute positively)
- Multiply by the core/supplementary tier weight
- Normalize to sum to 1
If all groups have negative or zero IC, the method falls back to equal-weight scoring.
ICIR-Weighted Scoring Details¶
ICIR (Information Coefficient Information Ratio) penalizes factors that are inconsistent predictors:
A factor with high mean IC but also high IC volatility receives a lower weight than a factor with moderate but stable IC. Falls back to equal-weight when all groups have ICIR = 0.
ML Scoring Details¶
Both ML methods train a model on historical (factor_scores, forward_returns) pairs and predict on the current period. The raw predictions are standardized to zero mean and unit variance.
from optimizer.factors import fit_ridge_composite, fit_gbt_composite, predict_composite_scores
# Fit ridge regression
model = fit_ridge_composite(
scores=historical_scores,
forward_returns=forward_returns,
alpha=1.0,
)
# Or fit gradient-boosted trees
model = fit_gbt_composite(
scores=historical_scores,
forward_returns=forward_returns,
max_depth=3,
n_estimators=50,
)
# Predict on current-period scores
composite = predict_composite_scores(model, current_scores)
The FittedMLModel type alias covers both RidgeCV and GradientBoostingRegressor.
Regime-Tilted Scoring¶
When regime tilts are applied, group weights can be passed through to the scoring functions:
from optimizer.factors import (
classify_regime,
apply_regime_tilts,
compute_composite_score,
RegimeTiltConfig,
FactorGroupType,
)
# Classify regime
regime = classify_regime(macro_data)
# Compute tilted weights
base_weights = {
FactorGroupType.VALUE: 1.0,
FactorGroupType.MOMENTUM: 1.0,
FactorGroupType.LOW_RISK: 1.0,
FactorGroupType.PROFITABILITY: 1.0,
}
tilted = apply_regime_tilts(
base_weights, regime, RegimeTiltConfig.for_moderate_tilts()
)
# Convert to string keys for compute_composite_score
group_weights = {g.value: w for g, w in tilted.items()}
composite = compute_composite_score(
standardized, coverage, group_weights=group_weights,
)
4. Stock Selection¶
Stock selection filters the scored universe down to a target number of stocks, with mechanisms to reduce unnecessary turnover.
SelectionConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
method |
SelectionMethod |
FIXED_COUNT |
Fixed-count or quantile-based selection |
target_count |
int |
100 |
Number of stocks to select (for FIXED_COUNT) |
target_quantile |
float |
0.8 |
Quantile threshold for entry (for QUANTILE, 0-1) |
exit_quantile |
float |
0.7 |
Exit quantile for hysteresis (for QUANTILE) |
buffer_fraction |
float |
0.1 |
Buffer zone fraction around selection boundary |
sector_balance |
bool |
True |
Whether to enforce sector-proportional representation |
sector_tolerance |
float |
0.03 |
Maximum deviation from parent universe sector weights |
SelectionMethod¶
| Method | Description |
|---|---|
FIXED_COUNT |
Select top N stocks by composite score |
QUANTILE |
Select all stocks above a quantile threshold |
Presets¶
from optimizer.factors import SelectionConfig
# Top 100 stocks (default)
config = SelectionConfig.for_top_100()
# Top quintile (top 20%)
config = SelectionConfig.for_top_quintile()
# Concentrated portfolio of top 30
config = SelectionConfig.for_concentrated()
Buffer-Zone Hysteresis¶
Hysteresis prevents excessive turnover by creating a buffer zone around the selection boundary. Current members within the buffer are retained even if they would not qualify as new entrants.
Fixed-Count hysteresis: The top target_count stocks are always included. Current members ranking between target_count and target_count + buffer_fraction * target_count are retained.
from optimizer.factors import select_fixed_count
selected = select_fixed_count(
scores=composite_scores,
target_count=100,
buffer_fraction=0.1, # Buffer of 10 stocks
current_members=previous_selection, # pd.Index of previously selected tickers
)
Quantile hysteresis: New stocks must score above target_quantile (e.g., 80th percentile). Existing members survive as long as they stay above exit_quantile (e.g., 70th percentile).
from optimizer.factors import select_quantile
selected = select_quantile(
scores=composite_scores,
target_quantile=0.8, # Entry threshold
exit_quantile=0.7, # Exit threshold (lower = more sticky)
current_members=previous_selection,
)
Sector Balancing¶
When sector_balance=True, the selection is adjusted so that no sector is over- or under-represented relative to the parent universe by more than sector_tolerance:
from optimizer.factors import apply_sector_balance
balanced = apply_sector_balance(
selected=initial_selection,
scores=composite_scores,
sector_labels=sector_series,
parent_universe=full_universe,
tolerance=0.03,
)
Under-represented sectors gain their highest-scoring non-selected stocks. Over-represented sectors lose their lowest-scoring selected stocks.
Full Selection Pipeline¶
from optimizer.factors import select_stocks, SelectionConfig
config = SelectionConfig(
method=SelectionMethod.FIXED_COUNT,
target_count=100,
buffer_fraction=0.1,
sector_balance=True,
sector_tolerance=0.03,
)
# Without turnover tracking
selected = select_stocks(
scores=composite_scores,
config=config,
current_members=previous_selection,
sector_labels=sector_series,
parent_universe=full_universe,
)
# With turnover tracking
selected, turnover = select_stocks(
scores=composite_scores,
config=config,
current_members=previous_selection,
sector_labels=sector_series,
parent_universe=full_universe,
return_turnover=True,
)
Selection Turnover¶
from optimizer.factors import compute_selection_turnover
turnover = compute_selection_turnover(
current=previous_selection,
new=new_selection,
universe=full_universe,
)
# turnover = len(added | removed) / len(universe)
5. Regime Tilts¶
Regime tilts apply macro-economic regime-conditional adjustments to factor group weights. The system classifies the current macro environment and applies multiplicative tilts to emphasize factors with stronger expected performance in that regime.
MacroRegime¶
| Regime | Description | Factor Emphasis |
|---|---|---|
EXPANSION |
GDP above trend, accelerating | Momentum (1.2x), reduce Value/Low Risk |
SLOWDOWN |
GDP above trend, decelerating | Low Risk (1.3x), Dividend (1.2x), reduce Momentum |
RECESSION |
GDP below trend, decelerating | Low Risk (1.5x), Profitability (1.3x), Value (1.2x), reduce Momentum |
RECOVERY |
GDP below trend, accelerating | Value (1.3x), Momentum (1.2x), reduce Low Risk |
RegimeTiltConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
enable |
bool |
False |
Whether to apply regime tilts |
expansion_tilts |
tuple[tuple[str, float], ...] |
See defaults | Group tilts during expansion |
slowdown_tilts |
tuple[tuple[str, float], ...] |
See defaults | Group tilts during slowdown |
recession_tilts |
tuple[tuple[str, float], ...] |
See defaults | Group tilts during recession |
recovery_tilts |
tuple[tuple[str, float], ...] |
See defaults | Group tilts during recovery |
Tilts are stored as tuples of (group_name, tilt_factor) for frozen-dataclass compatibility.
Presets¶
from optimizer.factors import RegimeTiltConfig
# Enable moderate tilts (uses the built-in tilt tables)
config = RegimeTiltConfig.for_moderate_tilts()
# Disable tilts (default)
config = RegimeTiltConfig.for_no_tilts()
Regime Classification¶
from optimizer.factors import classify_regime
regime = classify_regime(macro_data)
# macro_data: pd.DataFrame with date index and columns like
# 'gdp_growth', 'yield_spread', 'unemployment_rate'
The classification heuristic uses GDP growth as the primary signal:
- If
gdp_growthis available with 2+ observations:- Rising unemployment with positive GDP overrides to
SLOWDOWN - Current > trend and current > previous -->
EXPANSION - Current > trend and current <= previous -->
SLOWDOWN - Current <= trend and current <= previous -->
RECESSION - Current <= trend and current > previous -->
RECOVERY
- Rising unemployment with positive GDP overrides to
- Fallback:
yield_spread(10Y-2Y Treasury spread):- > 1.0 -->
EXPANSION - > 0.0 -->
SLOWDOWN - > -0.5 -->
RECOVERY - <= -0.5 -->
RECESSION
- > 1.0 -->
- Default:
EXPANSION
Applying Tilts¶
from optimizer.factors import apply_regime_tilts, get_regime_tilts, FactorGroupType, MacroRegime
# Get the raw tilt dictionary for a regime
tilts = get_regime_tilts(MacroRegime.RECESSION)
# {FactorGroupType.LOW_RISK: 1.5, FactorGroupType.PROFITABILITY: 1.3, ...}
# Groups not listed receive a default tilt of 1.0
# Apply tilts to base group weights (with re-normalization)
base_weights = {
FactorGroupType.VALUE: 1.0,
FactorGroupType.PROFITABILITY: 1.0,
FactorGroupType.MOMENTUM: 1.0,
FactorGroupType.LOW_RISK: 1.0,
}
tilted = apply_regime_tilts(
group_weights=base_weights,
regime=MacroRegime.RECESSION,
config=RegimeTiltConfig.for_moderate_tilts(),
)
Re-Normalization
After applying multiplicative tilts, the total weight is re-normalized to preserve the original total. This ensures that tilts only change the relative allocation between groups, not the overall magnitude.
Disabled by Default
RegimeTiltConfig.enable defaults to False. When enable=False, apply_regime_tilts() returns a copy of the original weights unchanged. You must explicitly use RegimeTiltConfig.for_moderate_tilts() or set enable=True.
6. Validation¶
Factor validation assesses the statistical significance and economic value of factors before deploying them in production.
FactorValidationConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
newey_west_lags |
int |
6 |
Number of lags for Newey-West HAC standard errors |
t_stat_threshold |
float |
2.0 |
Minimum absolute t-statistic for significance |
fdr_alpha |
float |
0.05 |
False discovery rate alpha level |
n_quantiles |
int |
5 |
Number of quantiles for spread analysis |
fmp_top_pct |
float |
0.2 |
Top percentile for factor-mimicking portfolios |
fmp_bottom_pct |
float |
0.2 |
Bottom percentile for factor-mimicking portfolios |
Presets¶
from optimizer.factors import FactorValidationConfig
# Standard validation
config = FactorValidationConfig.for_standard()
# Strict validation (t > 3.0, FDR alpha = 1%)
config = FactorValidationConfig.for_strict()
Information Coefficient (IC) Analysis¶
The Information Coefficient is the Spearman rank correlation between factor scores and subsequent forward returns. A positive IC indicates that higher factor scores predict higher returns.
from optimizer.factors import compute_monthly_ic, compute_ic_series, compute_icir, compute_ic_stats
# Single-period IC
ic = compute_monthly_ic(factor_scores, forward_returns)
# IC time series (one IC per date)
ic_series = compute_ic_series(
factor_scores_history=scores_df, # Dates x tickers matrix
returns_history=returns_df, # Dates x tickers matrix
factor_name="book_to_price",
)
# ICIR: mean(IC) / std(IC)
icir = compute_icir(ic_series)
# Full IC statistics with Newey-West inference
stats = compute_ic_stats(ic_series, lags=5)
# stats.mean, stats.variance_nw, stats.t_stat_nw, stats.p_value, stats.icir
Newey-West t-Statistic¶
The Newey-West HAC (heteroscedasticity and autocorrelation consistent) estimator provides robust standard errors for IC significance testing, accounting for the serial correlation inherent in overlapping IC measurements.
from optimizer.factors import compute_newey_west_tstat
t_stat, p_value = compute_newey_west_tstat(ic_series, n_lags=6)
The variance estimator uses Bartlett kernel weights:
where gamma_j = E[(IC_t - mean)(IC_{t-j} - mean)].
Multiple Testing Correction¶
When testing multiple factors simultaneously, p-values must be corrected for multiple comparisons.
from optimizer.factors import correct_pvalues, benjamini_hochberg
import numpy as np
# Holm-Bonferroni (FWER) + Benjamini-Hochberg (FDR)
raw_pvalues = np.array([0.01, 0.04, 0.03, 0.15, 0.02])
corrected = correct_pvalues(raw_pvalues, alpha=0.05)
# corrected.holm: Holm-Bonferroni adjusted p-values (controls family-wise error rate)
# corrected.bh: Benjamini-Hochberg adjusted p-values (controls false discovery rate)
# Standalone BH correction (returns boolean series)
significant = benjamini_hochberg(p_values_series, alpha=0.05)
Variance Inflation Factor (VIF)¶
VIF detects multicollinearity among factors. A VIF above 10 indicates that the factor's variance is largely explained by other factors.
from optimizer.factors import compute_vif
vif = compute_vif(standardized_factors)
# pd.Series: VIF per factor (>= 1.0 by construction)
high_vif = vif[vif > 10] # Candidates for removal or merging
Quantile Spread Analysis¶
Quantile spreads measure the economic value of a factor by comparing returns across factor-sorted portfolios.
from optimizer.factors import compute_quantile_spread
# Single-period spread: top quantile return - bottom quantile return
spread = compute_quantile_spread(
factor_scores=scores_series,
forward_returns=returns_series,
n_quantiles=5,
)
Factor Spread Benchmarks¶
The module includes annualized long-short quintile spread benchmarks derived from academic literature (Fama-French, AQR, Novy-Marx):
| Group | Low | High |
|---|---|---|
| value | 2% | 6% |
| profitability | 2% | 5% |
| investment | 1% | 4% |
| momentum | 4% | 10% |
| low_risk | 1% | 4% |
| liquidity | 1% | 3% |
| dividend | 1% | 3% |
| sentiment | 0.5% | 2% |
| ownership | 0.5% | 2% |
Universe-Level Validation¶
validate_factor_universe() validates all factors simultaneously with Newey-West inference and multiple testing correction:
from optimizer.factors import validate_factor_universe
summary = validate_factor_universe(
ic_matrix=ic_matrix, # Dates x factors matrix of IC values
lags=5,
alpha=0.05,
)
# Returns pd.DataFrame with columns:
# ic_mean, icir, t_stat_nw, p_value_raw, p_value_holm, p_value_bh,
# significant_holm, significant_bh
Full Validation Report¶
from optimizer.factors import run_factor_validation, FactorValidationConfig
report = run_factor_validation(
factor_scores_history={
"book_to_price": scores_bp_df, # Dates x tickers per factor
"momentum_12_1": scores_mom_df,
},
returns_history=returns_df, # Dates x tickers forward returns
config=FactorValidationConfig.for_standard(),
)
# report.ic_results: list[ICResult] with per-factor IC, t-stat, p-value
# report.quantile_spreads: list[QuantileSpreadResult] with per-factor spreads
# report.significant_factors: list[str] (BH FDR-significant factors)
# report.significant_factors_holm: list[str] (Holm FWER-significant factors)
Out-of-Sample Validation¶
Rolling block or combinatorial purged cross-validation (CPCV) for out-of-sample factor assessment:
from optimizer.factors import run_factor_oos_validation, FactorOOSConfig
# Rolling block OOS
config = FactorOOSConfig(
train_months=36, # 3-year training window
val_months=12, # 1-year validation window
step_months=6, # Roll forward 6 months per fold
)
result = run_factor_oos_validation(
scores=panel_scores, # MultiIndex (date, ticker) x factors
returns=panel_returns, # MultiIndex (date, ticker) x return column
config=config,
)
# result.per_fold_ic: n_folds x factors DataFrame of mean IC per fold
# result.per_fold_spread: n_folds x factors DataFrame of mean spread per fold
# result.mean_oos_ic: pd.Series of mean OOS IC per factor
# result.mean_oos_icir: pd.Series of OOS ICIR per factor
# result.n_folds: int
FactorOOSConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
train_months |
int |
36 |
Length of the training window in months |
val_months |
int |
12 |
Length of the validation window in months |
step_months |
int |
6 |
Number of months to roll forward between folds |
CPCV Mode¶
When a CPCVConfig is provided, CPCV is used instead of rolling blocks. CPCV generates all C(n_folds, n_test_folds) combinations with purging and embargo at train-test boundaries:
from optimizer.validation import CPCVConfig
cpcv = CPCVConfig(
n_folds=10,
n_test_folds=2,
purged_size=3,
embargo_size=5,
)
result = run_factor_oos_validation(
scores=panel_scores,
returns=panel_returns,
cpcv_config=cpcv, # Overrides config when provided
)
Input Format for OOS Validation
scores must have a two-level row MultiIndex (date, ticker) with one column per factor. returns must have the same MultiIndex with a single return column.
7. Diagnostics¶
Diagnostic tools for assessing factor quality, redundancy, and data integrity.
PCA Analysis¶
from optimizer.factors import compute_factor_pca
pca_result = compute_factor_pca(
scores=standardized_factors,
n_components=None, # Keep all components
)
# pca_result.explained_variance_ratio: ndarray of variance per component
# pca_result.loadings: pd.DataFrame (factors x PCs) -- PCA loading matrix
# pca_result.n_components_95pct: smallest n components for >= 95% variance
Redundant Factor Detection¶
from optimizer.factors import flag_redundant_factors
redundant = flag_redundant_factors(
scores=standardized_factors,
vif_threshold=10.0, # VIF cutoff (5 = conservative, 10 = standard)
)
# redundant: list[str] of factor names with VIF > threshold
Survivorship Bias Check¶
from optimizer.factors import check_survivorship_bias
has_bias = check_survivorship_bias(
returns=returns_df,
final_periods=12, # Inspect last 12 periods
zero_threshold=1e-10,
)
# True if no assets have near-zero returns in the tail (potential survivorship bias)
The heuristic is simple: if no asset appears to have stopped trading (near-zero returns in the final periods), the dataset may exclude delisted or failed companies. A UserWarning is emitted when survivorship bias is suspected.
8. Mimicking Portfolios¶
Factor-mimicking portfolios are long-short portfolios designed to isolate pure factor exposure. They are used for factor premium estimation, validation, and cross-factor correlation analysis.
Building Mimicking Portfolios¶
from optimizer.factors import build_factor_mimicking_portfolios
fmp_returns = build_factor_mimicking_portfolios(
scores=scores_df, # Dates x assets matrix for one factor
returns=returns_df, # Dates x assets return matrix
quantile=0.30, # 30% in each leg
weighting="equal", # "equal" or "value"
)
# fmp_returns: pd.DataFrame with column "factor_return"
For each date, the top quantile fraction of assets (by factor score) are held long and the bottom quantile fraction are held short. The function processes one factor at a time. For multiple factors, call once per factor and concatenate:
import pandas as pd
from optimizer.factors import build_factor_mimicking_portfolios
factor_returns = pd.concat([
build_factor_mimicking_portfolios(scores_value, returns)
.rename(columns={"factor_return": "value"}),
build_factor_mimicking_portfolios(scores_mom, returns)
.rename(columns={"factor_return": "momentum"}),
], axis=1)
Beta-Neutral Mimicking Portfolios¶
When beta_neutral=True, the hedge ratio adjusts the short-leg weight to approximate zero market beta exposure:
fmp_returns = build_factor_mimicking_portfolios(
scores=scores_df,
returns=returns_df,
quantile=0.30,
beta_neutral=True,
market_returns=market_series, # Required when beta_neutral=True
)
The hedge ratio is computed as beta_long / beta_short, where each beta is the OLS regression coefficient of the leg returns against market returns.
Quintile Spread Analysis¶
from optimizer.factors import compute_quintile_spread
result = compute_quintile_spread(
scores=scores_df, # Dates x assets factor scores
returns=returns_df, # Dates x assets returns
n_quantiles=5,
)
# result.quintile_returns: pd.DataFrame (Dates x Q1..Q5) -- per-bucket returns
# result.spread_returns: pd.Series (Q5 - Q1) -- long-short spread
# result.annualised_mean: mean daily spread * 252
# result.t_stat: mean / (std / sqrt(T))
# result.sharpe: mean * sqrt(252) / std
Assets are ranked by factor score at each date and split into n_quantiles equal-count buckets. Q1 = lowest scores (short), Qn = highest scores (long).
Cross-Factor Correlation¶
from optimizer.factors import compute_cross_factor_correlation
corr_matrix = compute_cross_factor_correlation(factor_returns)
# pd.DataFrame: factors x factors Pearson correlation matrix
9. Integration with Optimization¶
The integration layer bridges factor scores and analytics to portfolio optimization inputs: expected returns, exposure constraints, Black-Litterman views, and net alpha.
FactorIntegrationConfig¶
| Field | Type | Default | Description |
|---|---|---|---|
risk_free_rate |
float |
0.04 |
Annual risk-free rate |
market_risk_premium |
float |
0.05 |
Annual equity risk premium |
use_black_litterman |
bool |
False |
Whether to generate BL views from factor scores |
exposure_lower_bound |
float |
-0.5 |
Lower bound for factor exposure constraints |
exposure_upper_bound |
float |
0.5 |
Upper bound for factor exposure constraints |
Presets¶
from optimizer.factors import FactorIntegrationConfig
# Direct factor score to expected return mapping
config = FactorIntegrationConfig.for_linear_mapping()
# Factor-based Black-Litterman views
config = FactorIntegrationConfig.for_black_litterman()
Factor Scores to Expected Returns¶
Convert factor Z-scores to expected returns via a linear model:
from optimizer.factors import factor_scores_to_expected_returns
expected_returns = factor_scores_to_expected_returns(
scores=group_scores, # Assets x factor-groups DataFrame
betas=market_betas, # pd.Series of CAPM beta per asset
factor_premiums={
"market": 0.05,
"value": 0.03,
"momentum": 0.04,
"profitability": 0.02,
},
risk_free_rate=0.02,
)
Assets missing from betas are treated as having a beta of 1.0. The "market" key provides the market premium; all other keys are matched against columns in scores.
Factor Exposure Constraints¶
Build linear inequality constraints that limit portfolio factor exposure, ready for MeanRisk:
from optimizer.factors import build_factor_exposure_constraints
# Uniform bounds: all factors constrained to [-0.5, 0.5]
constraints = build_factor_exposure_constraints(
factor_scores=standardized,
bounds=(-0.5, 0.5),
)
# Per-factor bounds
constraints = build_factor_exposure_constraints(
factor_scores=standardized,
bounds={
"book_to_price": (-0.3, 0.3),
"momentum_12_1": (-0.5, 0.5),
"volatility": (-0.2, 0.2),
},
)
# Use with MeanRisk optimizer
from optimizer.optimization import MeanRiskConfig, build_mean_risk
model = build_mean_risk(
MeanRiskConfig.for_max_sharpe(),
factor_exposure_constraints=constraints,
)
The constraint encodes lb_g <= sum_i w_i * z_{i,g} <= ub_g as the pair left_inequality @ w <= right_inequality (two rows per factor: one for the lower bound, one for the upper bound).
Feasibility Warning
build_factor_exposure_constraints() checks whether the equal-weight portfolio exposure falls within the bounds for each factor. If not, a UserWarning is emitted indicating the constraint may be infeasible. Tighten bounds carefully.
Black-Litterman Views from Factors¶
Generate relative views for Black-Litterman based on factor scores and factor premia:
from optimizer.factors import build_factor_bl_views
views, confidences = build_factor_bl_views(
factor_scores=standardized,
factor_premia={"book_to_price": 0.03, "momentum_12_1": 0.06},
selected_tickers=selected,
)
# views: list[tuple[str, ...]] -- top-quartile vs bottom-quartile tickers
# confidences: list[float] -- |premium| as confidence
For each factor, the function identifies top-quartile and bottom-quartile assets and generates a relative view that the top outperforms the bottom by the factor premium.
Factor Premia Estimation¶
Estimate annualized factor premia from long-short factor-mimicking portfolio returns:
from optimizer.factors import estimate_factor_premia
premia = estimate_factor_premia(factor_mimicking_returns)
# dict[str, float]: annualized premium per factor (mean_daily * 252)
Net Alpha¶
Compute factor alpha after deducting turnover-based transaction costs:
from optimizer.factors import compute_net_alpha
result = compute_net_alpha(
ic_series=ic_series, # Time series of IC values
weights_history=weights_df, # Dates x assets weight matrix
cost_bps=10.0, # Round-trip cost in basis points
annualisation=252,
)
# result.gross_alpha: mean(IC) * sqrt(252)
# result.avg_turnover: mean one-way turnover across rebalancing dates
# result.total_cost: avg_turnover * cost_bps / 10_000
# result.net_alpha: gross_alpha - total_cost
# result.net_icir: net_alpha / (std(IC) * sqrt(252))
Net ICIR
net_icir divides the net alpha by the annualized IC volatility. A net ICIR above 0.5 is generally considered attractive for a factor strategy; above 1.0 is exceptional.
Gross Alpha Recovery¶
from optimizer.factors import compute_gross_alpha
gross = compute_gross_alpha(
net_alpha=0.03,
avg_turnover=0.50,
cost_bps=10.0,
)
# gross = net_alpha + avg_turnover * cost_bps / 10_000
End-to-End Example¶
A complete workflow from raw data to optimized portfolio:
import pandas as pd
from optimizer.factors import (
FactorConstructionConfig,
StandardizationConfig,
CompositeScoringConfig,
SelectionConfig,
RegimeTiltConfig,
FactorValidationConfig,
FactorIntegrationConfig,
compute_all_factors,
standardize_all_factors,
compute_composite_score,
select_stocks,
classify_regime,
apply_regime_tilts,
run_factor_validation,
build_factor_exposure_constraints,
FactorGroupType,
)
# 1. Construction: compute raw factor scores
construction_config = FactorConstructionConfig.for_all_factors()
raw_factors = compute_all_factors(
fundamentals=fundamentals_df,
price_history=price_df,
volume_history=volume_df,
analyst_data=analyst_df,
config=construction_config,
)
# 2. Standardization: winsorize, z-score, sector-neutralize
std_config = StandardizationConfig(neutralize_sector=True)
standardized, coverage = standardize_all_factors(
raw_factors, config=std_config, sector_labels=sectors,
)
# 3. Regime tilts (optional)
regime = classify_regime(macro_data)
base_weights = {g: 1.0 for g in FactorGroupType}
tilted = apply_regime_tilts(
base_weights, regime, RegimeTiltConfig.for_moderate_tilts(),
)
group_weights = {g.value: w for g, w in tilted.items()}
# 4. Composite scoring
scoring_config = CompositeScoringConfig.for_equal_weight()
composite = compute_composite_score(
standardized, coverage, config=scoring_config,
group_weights=group_weights,
)
# 5. Stock selection
selection_config = SelectionConfig.for_top_100()
selected = select_stocks(
scores=composite,
config=selection_config,
sector_labels=sectors,
parent_universe=standardized.index,
)
# 6. Validation (on historical data)
report = run_factor_validation(
factor_scores_history=historical_scores,
returns_history=historical_returns,
config=FactorValidationConfig.for_standard(),
)
print(f"Significant factors (BH): {report.significant_factors}")
# 7. Integration: build constraints for optimizer
constraints = build_factor_exposure_constraints(
factor_scores=standardized.loc[selected],
bounds=(-0.5, 0.5),
)
# 8. Pass to optimizer
from optimizer.optimization import MeanRiskConfig, build_mean_risk
model = build_mean_risk(
MeanRiskConfig.for_max_sharpe(),
factor_exposure_constraints=constraints,
)
# model.fit(returns_selected) ...
Gotchas and Tips¶
-
Sign conventions matter. Volatility, beta, and asset growth are sign-flipped internally so that higher values always indicate a more favorable exposure. Do not negate these yourself before passing to the pipeline.
-
Point-in-time alignment is critical. Always use
align_to_pit()with appropriate publication lags when constructing factors from fundamental data. UsingPublicationLagConfigwith source-specific lags is more accurate than a single uniform lag. -
Coverage-weighted group aggregation.
compute_group_scores()uses a coverage-weighted mean, not a simple mean. Factors with NaN scores do not drag down the group score for tickers where they are missing -- they are simply excluded from the average. -
IC-weighted fallback. When all factor groups have negative or zero IC, both
compute_ic_weighted_composite()andcompute_icir_weighted_composite()fall back to equal-weight scoring rather than producing degenerate weights. -
ML scoring requires temporal separation. The
training_scoresandtraining_returnsfor RIDGE_WEIGHTED and GBT_WEIGHTED must not overlap with the current prediction period. The caller is responsible for this split. -
Hysteresis reduces turnover. Both
select_fixed_count()andselect_quantile()acceptcurrent_membersto implement buffer-zone hysteresis. Without passing previous members, every rebalancing produces a fresh selection from scratch, potentially causing excessive turnover. -
Sector balance adjustments are post-hoc.
apply_sector_balance()runs after the initial selection and may add or remove stocks to meet tolerance constraints. The final count may differ slightly fromtarget_count. -
Regime tilts are disabled by default.
RegimeTiltConfig.enableisFalse. When disabled,apply_regime_tilts()returns the original weights unchanged, even if tilt tables are defined in the config. -
OOS validation input format.
run_factor_oos_validation()expects a two-level MultiIndex(date, ticker)on bothscoresandreturns. This is different from other functions that use separate dates-x-tickers DataFrames. -
Factor exposure constraints require matching tickers. The tickers in
factor_scorespassed tobuild_factor_exposure_constraints()must match the assets used in the optimizerfit()call. Mismatches produce incorrect constraint matrices.