Systematic Equity Trading: Exploiting Price Reversals through Quantitative Analysis
A deep dive into building an algorithmic trading system, from defining the investable universe to visualizing the equity curve.
Download the source code by clicking the button at the end of the article
The difference between a hypothesis and a trading strategy is rigorous testing. While the concept of buying low and selling high is timeless, executing it systematically across a basket of hundreds of stocks requires a robust technological framework. This guide details the architecture of a quantitative mean-reversion strategy designed to trade against short-term market dislocations. We will explore the mechanics of “cleaning” market data, ensuring trade execution matches liquidity constraints, and utilizing the Efficient Frontier to dynamically adjust position sizes. By the end, we will have a clear view of how algorithmic precision can turn historical market noise into a measurable edge.
import warnings
warnings.filterwarnings(’ignore’)When the interpreter reaches these two lines, the program immediately alters the global behavior of Python’s warnings subsystem: it tells the runtime to suppress all warnings from that point forward. Practically, any code that runs later — data loaders, numerical libraries, backtesting routines, or metric calculators — will not emit warning messages that would otherwise surface potential problems such as invalid numerical operations, deprecated API usage, silently coerced dtypes, or potential divide-by-zero events. This is a blunt, process-wide decision that prevents warning text from appearing in logs, notebooks, or consoles, so the “story” of the data pipeline is one of visual silence rather than diagnostic verbosity.
The likely reason someone adds this is pragmatic: during exploratory analysis or when third-party libraries produce a lot of noisy warnings that the developer believes are benign, silencing them makes output easier to read. In the context of quantitative strategy evaluation and performance metrics, that can temporarily improve focus on charts and summary tables by removing nuisance messages. However, that “why” comes with significant trade-offs. Warnings often act as early indicators of subtle issues that directly affect metric correctness — for example, a NumPy invalid-value warning can correspond to NaNs propagating into returns, Sharpe, or drawdown calculations; a Pandas dtype or alignment warning can change grouping or resampling behavior; a deprecation warning may foreshadow future breakage that will alter backtest reproducibility. Because the filter is global and permanent for the process, it can mask problems that should be investigated, making debugging and code auditability much harder.
A safer pattern is to make any suppression intentional and scoped. Prefer filtering specific warning categories and modules rather than silencing everything, or use a temporary context so only a narrow block of known-noisy calls is muted. During development and in continuous integration, convert warnings into errors so issues surface early; in production, record warnings to structured logs rather than hiding them. Always accompany any blanket suppression with a comment explaining why it’s safe here and include tests that assert no unexpected NaNs or loss of data fidelity in your metric outputs. That approach preserves the immediate cleanliness of output when needed, while retaining the diagnostic signal that protects the integrity and reproducibility of your quantitative strategy evaluation.
import sys
import numpy as np
import pandas as pd
from pytz import UTC
from logbook import (NestedSetup, NullHandler, Logger, StreamHandler, StderrHandler,
INFO, WARNING, DEBUG, ERROR)
from zipline import run_algorithm
from zipline.api import (attach_pipeline,
date_rules,
time_rules,
get_datetime,
order_target_percent,
pipeline_output,
record,
schedule_function,
get_open_orders,
calendars,
set_commission,
set_slippage)
from zipline.finance import commission, slippage
from zipline.pipeline import Pipeline, CustomFactor
from zipline.pipeline.factors import Returns, AverageDollarVolume
from pyfolio.utils import extract_rets_pos_txn_from_zipline
import matplotlib.pyplot as plt
import seaborn as snsThis import block sets up the toolchain for building, running, and analyzing a Zipline backtest with realistic execution assumptions and post-run performance attribution. At the highest level, run_algorithm is the entry point that orchestrates the simulation; everything else supplies the pieces it needs: data-processing primitives, scheduling and execution APIs, pipeline-based factor computation, transaction-cost models, logging, and post-run performance extraction and visualization.
We start with the numerical and time foundations: numpy and pandas are used for efficient array and time-series manipulation inside factors and algorithm logic, and UTC from pytz ensures all timestamps are normalized to a single timezone. Normalizing times to UTC avoids subtle bugs and mismatches when comparing traded timestamps, computing intraday scheduling, or exporting results to Pyfolio and plotting libraries, which expect consistent, timezone-aware datetimes.
The logging imports (logbook) are there to control runtime diagnostics. As you develop or debug strategies you’ll toggle log levels and handlers to capture pipeline attach/detach messages, scheduled-function executions, and order activity. Good logging is essential for reproducible quantitative experiments and for diagnosing source of unusual P&L or performance spikes.
Zipline’s core imports are next. run_algorithm drives the full backtest lifecycle, but the meat of a quantitative strategy lives in the API imports: attach_pipeline and Pipeline/CustomFactor let you define vectorized factor computations and a universe; pipeline_output fetches factor results for a given date; Returns and AverageDollarVolume are prebuilt factors used commonly to capture recent price performance and liquidity respectively; CustomFactor lets you implement bespoke signals that operate across columns and lookback windows. Using AverageDollarVolume to screen for liquidity is an explicit guard against building signals that are not tradable in realistic size — it’s a deliberate design choice to prevent overstating achievable returns.
date_rules and time_rules provide a concise, declarative way to schedule strategy logic (rebalance or risk checks) on particular days or times; schedule_function, combined with get_datetime inside the scheduled function, lets the algorithm make time-aware decisions (for example, rebalancing at market open or closing). order_target_percent is the primary execution primitive used here to express portfolio intents (what percent of portfolio to hold in each asset), and get_open_orders is used defensively to avoid duplicate or conflicting orders. calendars gives access to exchange calendars so scheduling and date arithmetic align with actual trading sessions.
To make backtests realistic, we wire up commission and slippage models (set_commission, set_slippage using zipline.finance). These models impose explicit transaction costs and price impact approximations; we do this because omitting them typically overestimates strategy returns and understates turnover-related degradation. Choosing appropriate models and parameters is a deliberate trade-off between conservatism and fidelity to expected live costs.
During the run, record is used to capture custom time-series metrics (risk exposures, cash, leverage, or any diagnostic) for later visualization; pipeline outputs combined with record let you trace how factor signals map to trades and eventual returns. After the backtest finishes, pyfolio.utils.extract_rets_pos_txn_from_zipline converts Zipline’s raw results into the canonical rets/positions/transactions trio that Pyfolio expects; this transformation is the bridge to rich performance attribution: drawdowns, factor exposures, turnover, trade-level P&L, and standard risk metrics like Sharpe and Sortino. Finally, matplotlib and seaborn are imported to visualize those results — cumulative returns, rolling statistics, heatmaps of factor exposures, and distributional plots — making it easier to interpret and communicate the strategy’s behavior.
In short, this collection of imports is purposeful: pipelines and factors produce tradable signals (with liquidity screens), scheduled functions and order primitives implement the trading policy, commission/slippage enforce realistic cost assumptions, logging and get_open_orders increase operational robustness, and Pyfolio + plotting convert simulation outputs into interpretable performance metrics. Each piece exists to reduce blind spots between an idealized alpha and what can actually be traded and measured in a real quantitative strategy evaluation.
Logging setup
# setup stdout logging
format_string = ‘[{record.time: %H:%M:%S.%f}]: {record.level_name}: {record.message}’
zipline_logging = NestedSetup([NullHandler(level=DEBUG),
StreamHandler(sys.stdout, format_string=format_string, level=INFO),
StreamHandler(sys.stderr, level=ERROR)])
zipline_logging.push_application()
log = Logger(’Algorithm’)This block initializes the logging surface that the backtest and strategy code will use to report runtime events, and it does so in a way that keeps normal output readable while still surfacing errors and preserving the ability to enable more verbose debug output if needed. The key goals are: (1) timestamped, leveled messages so trade/metric events can be correlated, (2) clear separation of informational output from error output so monitoring tools and users can react appropriately, and (3) a controlled, application-level handler stack that can be pushed and popped (useful for tests and embedding).
Concretely, the format string sets a compact but precise record format: a timestamp with microsecond resolution, the log level, and the message. Microsecond precision matters for quantitative strategy evaluation because many events (order placements, fills, metric calculations) can occur in rapid succession or need to be correlated with tick-level data and latency measurements; having that precision in every log line makes post‑run analysis and debugging far more reliable.
The NestedSetup is constructed with three handlers to enforce the output policy. The NullHandler at DEBUG level prevents low‑level debug lines from leaking into the default logging stream while still allowing a consumer to enable DEBUG explicitly later; it’s a way of owning log propagation so library-level debug noise doesn’t swamp the console. The StreamHandler pointed to stdout emits INFO and above with the human‑readable timestamp format; INFO is appropriate for routine operational messages such as periodic performance summaries, endpoint events, or start/stop markers that an analyst will want to see during a run or in CI. The StreamHandler to stderr is restricted to ERROR and above so runtime failures and exceptions are routed separately; this makes it straightforward for tooling to capture and alert on errors without parsing informational logs.
Calling push_application() activates that handler stack for the running application, which means subsequent loggers will inherit these rules until the setup is popped. Finally, creating Logger(‘Algorithm’) produces a namespaced logger that the algorithm and surrounding infrastructure should use. Using a named logger makes it possible to filter or route messages from the strategy, execution engine, or performance recorder independently during later analysis or in production (for example when you want verbose instrumentation from the execution subsystem but only INFO from the strategy logic).
In practice this pattern gives you deterministic, high‑resolution logs useful for reconstructing trade timelines and validating performance metrics, while keeping the console output focused and easy to parse. For production runs you might extend this by adding rotating file handlers or structured (JSON) outputs for automated metrics ingestion, but the current setup strikes a good balance between visibility and noise control for backtests and ad‑hoc evaluations.
Algorithm Settings
Configuration options for the algorithm.
# Settings
MONTH = 21
YEAR = 12 * MONTH
N_LONGS = 50
N_SHORTS = 50
VOL_SCREEN = 500These four lines are a compact configuration block that sets the time base, portfolio construction size, and a liquidity filter — each choice directly shapes how the strategy is backtested and how you compute and interpret performance metrics.
MONTH = 21 and YEAR = 12 * MONTH define the time units used for annualization and frequency assumptions. By setting MONTH to 21 you are adopting the common industry convention that a trading month has roughly 21 trading days; multiplying by 12 yields YEAR = 252 trading days. That YEAR constant is what you will use when converting per-period statistics into annualized numbers (for example, annualized return = mean(daily_returns) * YEAR, and annualized volatility = std(daily_returns) * sqrt(YEAR)). Making MONTH a named constant rather than hard-coding 252 makes the assumption explicit and easy to change if you want a different calendar (e.g., calendar days, longer/shorter lookbacks, or lower-frequency rebalancing).
N_LONGS and N_SHORTS (both 50) set the size of the long and short legs of the portfolio. These determine how many securities you will hold on each side when you construct the long-short portfolio (common in factor/backtest frameworks). Choosing symmetric N_LONGS and N_SHORTS supports a balanced portfolio construction and simplifies analysis of gross exposure and market neutrality; equal counts reduce gross directional bias if weights are symmetric, although true market-beta neutrality depends on weighting and selection rules as well. The choice of 50 versus, say, 10 or 200 is a risk/return tradeoff: smaller N increases concentration and idiosyncratic risk (and potentially higher per-name alpha), while larger N increases diversification, reduces idiosyncratic noise, and typically lowers turnover per instrument. These choices affect realized Sharpe, turnover, capacity, and the statistical significance of alpha estimates — so you should treat N_LONGS/N_SHORTS as hyperparameters to sweep during evaluation.
VOL_SCREEN = 500 is a liquidity screen applied to the investable universe so that the strategy ignores securities below that threshold. The intent is to ensure tradability and reduce unrealistic backtest results caused by illiquid names (which would amplify transaction costs, market impact, and execution slippage). The numeric meaning of 500 depends on the conventions in your data pipeline (e.g., 500 might mean 500k average daily shares, average daily dollar volume of $500k, or simply a raw count in the dataset), so confirm units before changing it. Setting VOL_SCREEN too high will shrink the universe and can bias results toward large-cap, highly liquid names; setting it too low will include illiquid names and artificially inflate returns if you ignore realistic costs. In practice this filter should be calibrated with expected position sizes and modeled transaction costs so the backtest remains realistic.
Operationally these constants feed three important places in the evaluation flow: (1) the time constants control how you aggregate and annualize returns and volatilities, (2) the N_LONGS/N_SHORTS control portfolio construction and concentration which influences P&L, turnover, and risk exposures, and (3) the VOL_SCREEN constrains the investable universe and therefore impacts capacity, realized turnover, and transaction-cost-adjusted performance. When you run sensitivity analysis or report performance metrics (annualized return, volatility, Sharpe, max drawdown, turnover, information ratio), treat these values as explicit hyperparameters: document them, sweep them where appropriate, and report how changes affect your conclusions so that the evaluation is robust and reproducible.
start = pd.Timestamp(’2013-01-01’, tz=UTC)
end = pd.Timestamp(’2017-01-01’, tz=UTC)
capital_base = 1e7These three constants define the temporal and monetary frame for the backtest and therefore determine how all downstream signals, orders and performance statistics are interpreted. The two Timestamp values set the evaluation window: everything — price series, trade events, and computed indicators — will be filtered or aligned to timestamps between start and end. Using an explicit timezone (UTC) is important because it prevents subtle misalignments when merging feeds or comparing timestamps from exchanges that report in different zones or switch during daylight saving transitions; that consistency avoids off-by-one-day errors in returns, incorrect intraday bar matching, and flaky lookbacks for time-based indicators. You should also confirm how the backtest engine treats the end boundary (inclusive vs exclusive) because that choice affects whether the last day’s activity is considered.
Choosing a multi-year window (2013–01–01 through 2017–01–01) gives a contiguous sample long enough to compute annualized statistics and stress across several market regimes. In practical terms this period sets the sample size used to annualize volatility, compute rolling drawdowns, and estimate statistical significance of alpha; it also determines whether your indicators have enough warm-up history. If your strategy uses long lookbacks or needs crisis-period data, you should revisit the window — both the start date for sufficient warm-up and whether the end date reflects the most recent market structure you care about.
capital_base = 1e7 is the notional starting bankroll for the simulated portfolio (10,000,000 in whatever currency your price data is denominated). That number is the anchor for converting percentage exposures and target weights into absolute position sizes (e.g., number of shares = capital_base * weight / price), and it scales all dollar P&L, margin usage and commission/fee computations. As a consequence, absolute metrics — total P&L, transaction costs, margin requirements — scale linearly with capital_base, whereas pure percentage metrics (percentage returns, Sharpe ratio computed with percent returns) remain invariant. This distinction matters for decision-making: if you’re optimizing for dollar profits or evaluating capacity, capital_base matters; if you’re comparing strategy skill across universes, normalized returns are what you look at.
Finally, pick these constants deliberately to match the rest of your pipeline: ensure the currency and conventions (e.g., fractional shares allowed or integer rounding applied) are consistent with capital_base, and check that downstream performance functions know the timezone semantics and inclusive/exclusive handling of end. Small mismatches here create hard-to-find discrepancies in drawdown timing, turnover estimates, and transaction-cost modelling that directly affect the quantitative evaluation and the business conclusions drawn from the backtest.
Mean-Reversion Factor
class MeanReversion(CustomFactor):
“”“Compute ratio of latest monthly return to 12m average,
normalized by std dev of monthly returns”“”
inputs = [Returns(window_length=MONTH)]
window_length = YEAR
def compute(self, today, assets, out, monthly_returns):
df = pd.DataFrame(monthly_returns)
out[:] = df.iloc[-1].sub(df.mean()).div(df.std())This CustomFactor implements a simple, cross-sectional mean-reversion signal by measuring how far an asset’s most recent monthly return deviates from its own 12‑month history in units of that asset’s monthly volatility. The factor pulls in monthly returns (the Returns input sampled at MONTH frequency) across a YEAR window, so the compute method receives a 2D array where each row is a month and each column corresponds to an asset for the past 12 months. Converting that array to a DataFrame lets the code express the operations cleanly: df.iloc[-1] picks the latest month’s return for every asset, df.mean() computes the 12‑month average per asset, and df.std() computes the 12‑month standard deviation per asset (the default pandas behavior is a column-wise mean/std when the rows are time).
The core algebra is (latest_return — mean) / std, i.e., a z‑score of the most recent monthly return relative to its trailing 12‑month distribution. Writing out the numerator and denominator explicitly makes clear the intended interpretation: positive values mean the asset has recently outperformed its own 12‑month average, negative values mean it has underperformed, and the division by std scales those deviations so assets with different return volatilities are comparable. The result is written into out[:] as the per-asset factor value that downstream algorithms (ranking, portfolio construction, or backtests) will consume.
Why this design for quantitative evaluation? Normalizing by the asset’s own volatility ensures the factor provides a risk-adjusted, cross-sectional signal rather than raw return magnitudes that would otherwise let high-volatility names dominate. Using monthly returns and a 12‑month window suppresses short-term noise and captures medium-term mean-reversion tendencies that many equity strategies target; it also aligns naturally with common performance metrics evaluated on monthly horizons (IC, hit-rate, monthly Sharpe contributions, etc.). The z-score form is convenient for ranking, thresholding (e.g., select extremes), and aggregating signals across assets.
A couple of practical caveats and improvements to consider: if an asset’s 12‑month std is zero (or effectively zero because of a short or flat history), the division will produce NaN or inf, so production code should guard against that (small epsilon, winsorization, or masking). The pandas std uses sample standard deviation (ddof=1) by default; depending on the sample size and bias concerns you might choose a different ddof or a robust scale estimator (MAD) if outliers are a concern. Finally, if you want more responsiveness to recent behavior, consider exponentially weighting the mean and std instead of equally weighting the last 12 months.
In short, this factor turns historical monthly returns into a standardized mean‑reversion score: it summarizes how extreme an asset’s latest return is relative to its own trailing 12‑month behavior, making the signal comparable across the cross-section and ready for quantitative evaluation and portfolio decision-making.
Create a pipeline
The Pipeline created by compute_factors() returns a table with two columns — “long” and “short” — each containing 25 stocks. Stocks are ranked by the deviation of their most recent monthly return from its annual average, normalized by the standard deviation; the selection comprises the 25 largest positive deviations and the 25 largest negative deviations. The universe is restricted to the 500 stocks with the highest average trading volume over the last 30 trading days.
def compute_factors():
“”“Create factor pipeline incl. mean reversion,
filtered by 30d Dollar Volume; capture factor ranks”“”
mean_reversion = MeanReversion()
dollar_volume = AverageDollarVolume(window_length=30)
return Pipeline(columns={’longs’ : mean_reversion.bottom(N_LONGS),
‘shorts’ : mean_reversion.top(N_SHORTS),
‘ranking’: mean_reversion.rank(ascending=False)},
screen=dollar_volume.top(VOL_SCREEN))This small function builds a single, reproducible factor pipeline whose outputs feed directly into our portfolio construction and the downstream performance metrics we use to evaluate the strategy. At a high level it does three things: compute a mean-reversion signal, restrict the universe to liquid names, and surface both top/bottom selections and a continuous rank for evaluation and weighting.
First, the mean-reversion factor is created and treated as the predictive signal: assets that look like recent “winners” versus “losers” under the factor are identified so we can form a classic mean-reversion long-short book. The code then takes the bottom N_LONGS of the factor as the long side and the top N_SHORTS as the short side. In a mean-reversion framework this convention typically means we are long recent underperformers (expecting them to bounce back) and short recent overperformers (expecting them to revert). The explicit selection of fixed counts (N_LONGS/N_SHORTS) enforces a controlled, symmetric portfolio construction which simplifies attribution and risk comparison across periods.
Second, the pipeline screens the universe using a 30-day AverageDollarVolume metric and keeps only the top names by that liquidity measure (VOL_SCREEN). The 30-day lookback is deliberate: it yields a stable, backward-looking liquidity estimate that reduces noise from one- or two-day spikes and avoids look-ahead bias because it uses historical data only. Filtering by dollar volume protects the strategy from microcap and illiquid securities that would distort realized performance once transaction costs and capacity constraints are considered.
Third, the code also exposes a ranking column from the same mean-reversion factor (rank(ascending=False)). Capturing ranks — rather than raw scores alone — is important for several reasons: ranks normalize different factor scales across time and cross-sections, enable computation of rank-based performance metrics (e.g., rank Information Coefficient), and support alternative weighting schemes (e.g., rank-weighted portfolios). The use of descending ranking (ascending=False) reflects the assumed score convention where larger factor values indicate stronger signals; if your factor sign convention differs, you’d flip this to keep the interpretation consistent.
Putting it together, the Pipeline call returns three named columns (longs, shorts, ranking) restricted to a liquid universe. That vectorized, time-consistent output is what we feed into backtests and metric calculations: the longs/shorts masks define discrete position sets for calculating realized long/short returns, turnover, and concentration, while the ranking column is used for continuous analyses like IC, rank decay, or alternate weighting schemes. Designing the pipeline this way keeps the signal generation, tradability screen, and evaluation hooks co-located and consistent, which is critical for reliable quantitative strategy evaluation and comparable performance metrics.
Before_trading_start() ensures the pipeline is executed daily and that its results, including the current prices, are recorded.
def before_trading_start(context, data):
“”“Run factor pipeline”“”
context.factor_data = pipeline_output(’factor_pipeline’)
record(factor_data=context.factor_data.ranking)
assets = context.factor_data.index
record(prices=data.current(assets, ‘price’))This function is the market-open entry point that refreshes and snapshots the factor signals the strategy will use for the trading day. First it calls the pipeline engine to materialize the latest outputs of the named pipeline; that result (typically a table keyed by asset) is stored on context so downstream intraday logic (rebalance handlers, risk checks, order sizing) can access the exact same signals without re-running the pipeline. Storing the pipeline output in context also preserves the canonical input used for debugging and offline evaluation.
Immediately after obtaining the pipeline output the code records the factor ranking timeseries into the backtest/performance system. Recording the ranking here gives you a persistent, timestamped trace of the signal (not just the orders or returns) which is essential for quantitative evaluation: you can compute information coefficients, rank decay, factor stability, and signal-to-noise metrics from this logged series. The code then extracts the set of assets present in the pipeline output (the assets universe for today) and requests current prices for that universe. Capturing the prices at this same snapshot is important because it aligns the signal and valuation: you need the price used to mark positions (entry price, market-to-market) to compute P&L, turnover, realized vs. unrealized returns, and to back out transaction costs and slippage for performance attribution.
Overall, the block ensures you have a consistent, auditable snapshot — factor values/ranks plus contemporaneous prices — immediately before trading begins. That consistency is why the pipeline is run here (fresh signals) and why both signals and prices are recorded: they are the raw inputs to portfolio construction, risk checks, execution decisions, and all downstream performance metrics and diagnostics.
Set Up Rebalancing
The new rebalance() method submits trade orders to exec_trades() for assets that the pipeline has flagged for long or short positions, assigning equal positive weights to longs and equal negative weights to shorts. It also divests any current holdings that are no longer present in the factor signals.
def rebalance(context, data):
“”“Compute long, short and obsolete holdings; place trade orders”“”
factor_data = context.factor_data
assets = factor_data.index
longs = assets[factor_data.longs]
shorts = assets[factor_data.shorts]
divest = context.portfolio.positions.keys() - longs.union(shorts)
log.info(’{} | Longs: {:2.0f} | Shorts: {:2.0f} | {:,.2f}’.format(get_datetime().date(),
len(longs),
len(shorts),
context.portfolio.portfolio_value))
exec_trades(data, assets=divest, target_percent=0)
exec_trades(data, assets=longs, target_percent=1 / N_LONGS if N_LONGS else 0)
exec_trades(data, assets=shorts, target_percent=-1 / N_SHORTS if N_SHORTS else 0)This function is the rebalancing routine that translates the current factor signals into concrete portfolio adjustments. The story begins with factor_data — a table of per-asset signals kept on the context. assets is the universe under review (the index of that table). The code then uses the boolean signal columns (.longs and .shorts) to partition the universe into the sets we want to hold long and short. Those boolean masks let us pick exactly which tickers are currently being signaled for each side, so the strategy’s signal generation is cleanly separated from execution.
Next we compute divest, the set of currently held positions that are no longer signaled either long or short. This is done by taking the current portfolio holdings (context.portfolio.positions.keys()) and subtracting the union of the current long and short sets. The intent is explicit: any instrument we own but no longer want should be closed out. We log the date, the counts of longs and shorts, and the current portfolio value so that, during backtests or live runs, you have a consistent snapshot for performance monitoring and turnover analysis.
Execution is performed via exec_trades, and the order of those calls is deliberate. First we call exec_trades for divest with target_percent=0 to close obsolete positions and free capital (and reduce unwanted exposure) before sizing new positions. Then we up-weight the current long set to an equal-weight target of 1 / N_LONGS per asset (and symmetrically set shorts to -1 / N_SHORTS). Equal weighting simplifies risk control and attribution: each signal contributes the same portfolio weight, making contribution-to-performance calculations straightforward and keeping per-name risk predictable. The inline guards (if N_LONGS else 0) prevent division-by-zero and effectively disable that side if no slots are configured.
A few important operational notes: exec_trades is responsible for converting target_percent into orders (market or otherwise), handling partial fills, and dealing with existing positions that may already be near their target; calling it for longs/shorts after closing obsolete positions ensures we don’t over-allocate capital. This routine does not perform liquidity checks, slippage modeling, or advanced risk limits — those responsibilities should be inside exec_trades or upstream risk controls. Finally, the counts logged here and the fixed target weights are key levers for quantitative strategy evaluation: they determine turnover, gross/net exposure, and how easily you can attribute returns to the factor signals when computing performance metrics.
def exec_trades(data, assets, target_percent):
“”“Place orders for assets using target portfolio percentage”“”
for asset in assets:
if data.can_trade(asset) and not get_open_orders(asset):
order_target_percent(asset, target_percent)This small routine is the trade-execution gatekeeper for a portfolio rebalancing step: its role is to walk the list of candidate assets and, where appropriate, instruct the trading engine to move the portfolio toward a desired weight for each asset. The loop is the narrative flow — for each asset we first check tradability via data.can_trade(asset) and then make sure there are no pending orders for that asset via get_open_orders(asset). Only when both conditions pass do we call order_target_percent(asset, target_percent), which is a high-level instruction to the execution layer to adjust that asset’s position to the specified fraction of current portfolio value.
Those two preconditions are important for correctness and for clean measurement. data.can_trade typically encodes market-state constraints (exchange open/closed, halts, delisted status) so we avoid creating orders that will be rejected or sit inactive; this reduces false positives in trade counts and prevents spurious fills that would distort realized returns and slippage statistics. get_open_orders prevents overlapping instructions for the same asset — without it you can create conflicting or duplicate orders in-flight, which increases turnover, complicates attribution, and biases performance metrics (for example by inflating trade counts and temporary inventory effects).
Using order_target_percent is a design choice that simplifies expressing portfolio intent: you declare the desired weight and let the engine compute the required quantity and sizing, so rebalancing is idempotent (if you’re already at the target weight nothing new should execute) and the code stays compact. That convenience, however, has downstream measurement consequences: the execution engine’s choices about market vs. limit orders, order chopping, fill timing and slippage directly affect realized P&L, transaction cost and turnover metrics, and should therefore be included when you evaluate strategy performance. Also note that this function applies the same target_percent to every asset in the provided list; if you need differentiated weights you must feed per-asset targets instead.
Operationally, this code is intentionally simple but also limited: it does not enforce risk limits (position caps, cash/margin checks), does not batch or stagger orders to manage market impact, and contains no retry or error handling for failed executions — all issues that matter when you translate a backtest into live trading and when you want clean, realistic performance measurement. For quantitative evaluation, consider augmenting the routine to log attempted and executed orders, record slippage and fees at the moment of fill, and apply sequencing or pacing logic so that turnover, trade frequency, and execution costs are tracked and controlled rather than implicitly delegated to the execution layer.
Initialize the Backtest
The rebalance() method executes according to the date_rules and time_rules configured via schedule_function(). By default it runs at the beginning of the week, immediately after market_open, as defined by the built-in US_EQUITIES calendar (see the docs for details on the rules).
You can specify a trade commission either in relative terms or as a minimum amount. You can also define slippage — the cost of an adverse price change between the trade decision and execution.
def initialize(context):
“”“Setup: register pipeline, schedule rebalancing,
and set trading params”“”
attach_pipeline(compute_factors(), ‘factor_pipeline’)
schedule_function(rebalance,
date_rules.week_start(),
time_rules.market_open(),
calendar=calendars.US_EQUITIES)
set_commission(us_equities=commission.PerShare(cost=0.00075,
min_trade_cost=.01))
set_slippage(us_equities=slippage.VolumeShareSlippage(volume_limit=0.0025,
price_impact=0.01))This initialize block wires together three responsibilities necessary for realistic, repeatable backtests: (1) where your cross-sectional signals come from, (2) when you act on those signals, and (3) how executions are modeled. Together these pieces ensure the strategy’s simulated P&L and derived metrics (turnover, net returns, Sharpe, drawdown, etc.) reflect realistic operational constraints so you can evaluate and compare quantitative approaches meaningfully.
First, attach_pipeline(compute_factors(), ‘factor_pipeline’) registers a centralized Pipeline that computes the factors and filtering logic used to rank and select stocks. By building and attaching the pipeline once here you guarantee that factor computations are performed consistently in the backtest engine’s cross-sectional engine (same universe, same masks and transforms) and are available for the scheduled rebalance to consume each run. The pipeline is the canonical source of your predictive signals and exposure information, which is essential for attribution and for reproducing factor returns across dates.
Second, schedule_function(rebalance, date_rules.week_start(), time_rules.market_open(), calendar=calendars.US_EQUITIES) determines when the algorithm converts pipeline output into trading actions. Scheduling a weekly rebalance at the market open on the first trading day of each week intentionally trades off responsiveness against transaction costs and noise: less frequent rebalancing reduces turnover and transaction cost drag, which typically leads to more realistic net performance for medium-frequency factor strategies; it also aligns decision cadence with how many fundamental or aggregated signals naturally refresh. Using time_rules.market_open and the US_EQUITIES calendar ensures the order placement is tied to actual trading sessions and holiday schedules, so execution assumptions and slippage modeling line up with market microstructure expectations.
Finally, set_commission and set_slippage install the cost and market-impact models that the simulator will apply to every trade. Commission.PerShare(cost=0.00075, min_trade_cost=.01) charges a small fixed per-share fee with a minimum per-trade floor — that prevents tiny theoretical trades from appearing “free” and inflating net returns. VolumeShareSlippage(volume_limit=0.0025, price_impact=0.01) models market impact by limiting the fraction of market volume you can realistically execute in a time slice (here, ~0.25% of available volume) and scaling price movement with participation (price_impact governs how aggressively price moves against you as participation rises). These parameters matter because they directly reduce realized P&L and change turnover optimization: strategies that look attractive on gross returns can become unattractive once slippage and commissions are applied. They also give you levers for sensitivity testing — calibrating volume_limit and price_impact to real execution data tightens the fidelity of your performance estimates.
Putting it all together: on each scheduled rebalance the algorithm will read the latest pipeline outputs, compute target weights, and submit orders. The backtester will then apply the slippage and commission models to those orders so the realized fills and resulting portfolio paths reflect trading frictions. That executed P&L is what feeds your performance metrics and informs strategy evaluation and parameter tuning. In practice you should validate the pipeline outputs, choose rebalance cadence with an eye toward signal decay and turnover, and calibrate execution parameters against historical trade data to ensure the backtest’s net performance is a credible proxy for live trading.
Run the Algorithm
The algorithm runs when run_algorithm() is invoked and returns a DataFrame containing the backtest performance.
backtest = run_algorithm(start=start,
end=end,
initialize=initialize,
before_trading_start=before_trading_start,
bundle=’quandl’,
capital_base=capital_base)
This single call is the orchestration point for running a historical simulation: it instantiates the backtest environment, feeds market data into your algorithm, executes the algorithm’s decisions against a simulated market microstructure, and returns the complete performance record for evaluation. First, run_algorithm boots the engine and calls your initialize function exactly once to set up persistent algorithm state (variables, commission/slippage models, scheduled tasks, the pipeline you plan to use, etc.). After initialization, the engine walks forward through each trading session in the date range you supplied (start → end) using the market calendar associated with the chosen data bundle. For each session it prepares the data slice for that day, invokes before_trading_start so you can compute pre-market signals and refresh any daily-only state, and then advances through the simulated intraday or daily bars, delivering market data to whatever trading entry points you have registered (scheduled functions, handle_data, pipeline outputs, etc.).
When your algorithm issues orders, the backtest engine applies the configured execution rules (fill rules, slippage model, commissions, and exchange/risk checks) and updates the simulated portfolio and cash balances by marking positions to market after fills. All orders, fills, position histories and cash movements are recorded as transactions and position snapshots; these are the raw events from which the performance engine computes daily returns, cumulative returns, drawdown, turnover, P&L attribution, Sharpe-like ratios and any other post-trade metrics you will use to evaluate strategy quality. The bundle argument selects the historical dataset and its associated adjustments (corporate actions, price adjustments, calendar), so it directly affects the prices and corporate-action treatment used to mark-to-market — critical for avoiding spurious alpha from unadjusted splits/dividends or survivorship bias. capital_base sets the starting cash value for the simulated portfolio and therefore scales leverage and absolute P&L; it’s important because many risk and return statistics are sensitive to the initial capital assumption.
In short, this line wires your algorithm into a full event-driven backtest: it prepares environment and data (initialize and bundle), runs daily pre-trade logic (before_trading_start), steps through market events applying execution models and risk checks, and finally returns a structured performance object (portfolio history, transactions, metrics) that you then use for quantitative strategy evaluation and comparative performance analysis. Be mindful that the chosen start/end window, the ingested bundle, and capital_base all materially change the realized performance numbers — so these are deliberate knobs you should set to ensure apples-to-apples strategy comparisons.
Extract inputs for pyfolio
The extract_rets_pos_txn_from_zipline utility in pyfolio extracts the data required to compute performance metrics.
returns, positions, transactions = extract_rets_pos_txn_from_zipline(backtest)This single extraction call is the handoff point between Zipline’s internal backtest state and the evaluation layer that computes all of our performance and trade-quality metrics. The backtest object contains the engine’s complete bookkeeping — time-indexed portfolio snapshots, executed trade events, cash ledger and the engine’s own return calculation — but it is not in the flattened, analysis-ready shape we want. extract_rets_pos_txn_from_zipline(backtest) standardizes and returns three orthogonal views of that bookkeeping: a returns series/table, a positions history, and a transactions log, each shaped for vectorized, reproducible metrics computation.
The returns output is what we use to measure realized performance over time: typically a time-indexed Series or DataFrame of portfolio (and possibly per-asset) simple returns or log returns. We extract and normalize these so downstream metrics (annualized return, volatility, Sharpe ratio, drawdown, rolling statistics) operate on consistent frequency, timezone, and return-convention. That normalization step is important because slight differences (daily vs. minute, simple vs. log) materially change annualization and aggregation behavior; by forcing a single convention here we avoid silent, hard-to-detect errors in risk/return calculations.
The positions output captures the holdings history — the quantity, maybe notional or market value, and any holding-level metadata at each timestamp. We extract positions to evaluate exposures, concentration, leverage, and turnover. Positions let us compute time-weighted exposures to sectors or factors, measure average holding periods, and reconcile trade-level events into cumulative exposures. A key reason to isolate positions is to validate consistency: the changes in positions should match the cumulative signed transactions. Any discrepancy here points to modeling issues like corporate-action adjustments or partial fills that must be handled before trusting performance attributions.
The transactions output is the granular trade log: timestamps, asset identifiers, signed quantities, executed prices, trade costs/fees and possibly execution flags. We need the transaction-level view to measure implementation quality (slippage vs. mid-price, fee impact), to compute realized vs. unrealized P&L, and to estimate turnover and transaction cost drag. Keeping raw transactions separate allows us to simulate alternative cost models, detect outlier trades, and compute trade-level statistics (average trade size, execution latency impact) that inform capacity and operational risk decisions.
Practically, pulling these three standardized artifacts out of the backtest decouples evaluation from the backtest engine’s internal API. That makes the evaluation code deterministic and vectorized: returns power the portfolio-level metrics, positions power exposure and attribution calculations, and transactions power cost, turnover, and execution-quality analyses. It also provides natural consistency checks (e.g., returns should equal P&L implied by positions and transactions) that surface bookkeeping bugs or corporate-action mismatches early. Finally, be mindful of common pitfalls when using the outputs: frequency and timezone alignment, whether returns are pre- or post-cost, whether positions are reported in units vs. notional, and how corporate actions are recorded — all of which must be understood and documented before you trust the downstream quantitative strategy evaluation.
Persisting Results for Use with pyfolio
with pd.HDFStore(’backtests.h5’) as store:
store.put(’backtest/equal_weight’, backtest)
store.put(’returns/equal_weight’, returns)
store.put(’positions/equal_weight’, positions)
store.put(’transactions/equal_weight’, transactions)This block opens a persistent HDF5 container and writes four distinct artifacts from the equal-weight strategy run into it, then closes the file. Using the context manager ensures the HDFStore is opened and closed cleanly (so buffers are flushed and the file descriptor released), which is important for data integrity and predictable downstream reads. Each store.put call writes a separate object under a hierarchical key: backtest/equal_weight, returns/equal_weight, positions/equal_weight, and transactions/equal_weight. The hierarchical naming is intentional — it gives a consistent namespace that makes it easy to find and compare the same asset of data across different strategies (for example backtest/market_cap vs backtest/equal_weight) without mixing types.
We store these four artifacts separately because each one serves a different role in performance evaluation. The “backtest” entry typically contains run-level metadata and summary statistics (configuration, parameters, total P&L, runtime diagnostics) and is useful for reproducibility and auditing which inputs produced the reported metrics. The “returns” entry is the time series of strategy P&L or periodic returns; it’s the primary input to standard performance metrics (Sharpe, Sortino, volatility, CAGR, drawdown analysis) and is often accessed frequently by analytics pipelines. The “positions” entry captures per-period exposures across instruments and is necessary to compute risk exposures, sector/asset concentration, and to reconcile position-level attribution. The “transactions” entry records trades executed (size, price, timestamp) and is required to estimate realized slippage, commissions, turnover, and to validate execution logic. Persisting them separately allows selective loading (load only returns when computing risk metrics, or only transactions when analyzing cost), which limits memory use and speeds analytical iterations.
There are also operational considerations tied to the choice of HDF5 and the usage pattern here. HDFStore is efficient for numeric, tabular time-series data and supports fast reads of subsets via keys; however, by default put writes in a fixed format which is fast for writes but not appendable — if you plan to stream incremental results into the same table, prefer format=’table’ or append=True. HDF5 is not ideal for concurrent multi-process writers, so coordinate access or use a different backend (database or object store) if you need parallel writes. You may also want to enable compression or chunking if these DataFrames are large. Finally, because these artifacts are the canonical record of the backtest run, keeping them consistent and versioned (consistent key naming and perhaps an index of runs) significantly improves reproducibility, makes comparisons across strategies straightforward, and directly supports downstream metric calculations and reporting.
Plot results
fig, axes= plt.subplots(nrows=2, figsize=(14,6))
returns.add(1).cumprod().sub(1).plot(ax=axes[0], title=’Cumulative Returns’)
transactions.groupby(transactions.dt.dt.day).txn_dollars.sum().cumsum().plot(ax=axes[1], title=’Cumulative Transactions’)
fig.tight_layout()
sns.despine();
This block builds a two-panel diagnostic plot that juxtaposes how the strategy’s equity curve evolves (top) with the cumulative cash flow from transactions (bottom), giving a quick visual summary of both performance and activity over time. It starts by allocating a single column with two stacked subplots, sized for readability; the explicit axes handles are passed into each plotting call so the two series render into their intended panels.
For the top panel the code computes a compounded equity curve from a series of periodic returns. The expression returns.add(1).cumprod().sub(1) implements the correct multiplicative compounding: add(1) converts each period return r into a growth factor (1 + r), cumprod chains those factors to reflect reinvestment and geometric compounding across periods, and sub(1) converts the compounded factor back into a net cumulative return. We use multiplicative aggregation because performance across periods compounds — summing period returns would misstate growth, especially over volatile or long horizons.
The bottom panel summarizes transaction-level cash flow by day. transactions.groupby(transactions.dt.dt.day).txn_dollars.sum().cumsum() first groups transaction rows by the day extracted from their datetime (transactions.dt.dt.day), sums dollar amounts per day to get daily net activity, and then cumulatively sums those daily totals to show how transaction dollars accumulate over time. Plotting this cumulative transactions series beside the equity curve helps separate P&L generation from capital deployment or withdrawals, which is useful for understanding turnover, funding patterns, and whether realized cash flow aligns with the performance curve.
A couple of practical notes that explain why certain choices matter for quantitative evaluation: using cumprod on (1 + returns) preserves accurate compound returns and is the correct basis for metrics like drawdown or annualized return; by contrast, simple sums would bias those metrics. Grouping by transactions.dt.dt.day, however, groups by day-of-month (1–31) rather than chronological date, so if your transactions span multiple months this will mix different months’ same-day activity — use transactions.dt.date or a resample/floor operation if you intend an ordered daily timeline. Finally, tight_layout reduces label overlap for presentation, and sns.despine removes the top/right axes lines to produce a cleaner, publication-ready look. Together these plots give a compact view of strategy performance (compounded returns) and operational footprint (cumulative transactions), which are core inputs to capacity, turnover, and transaction-cost analysis.
positions.index = positions.index.dateThis single line is converting the DataFrame/Series index from timestamped datetimes to plain calendar dates: positions.index.date returns an array of python datetime.date objects (one per original timestamp) and assigning that back replaces the index with those date values. The practical effect is that the index no longer contains time-of-day or timezone information — only the trading day — so subsequent joins, groupings or aggregations will operate at the daily level rather than at intraday timestamps.
Why we do this in a performance-evaluation context: most P&L, exposure and turnover metrics are computed on a daily cadence, and we typically want the position snapshot that held during a trading day to align directly with that day’s returns series (which is indexed by calendar date). Converting the index to dates simplifies alignment with daily returns, portfolio-level aggregation, and daily reporting because it removes intra-day granularity that would otherwise cause mismatches when merging or grouping by day.
Important caveats and how they affect downstream logic: this assignment converts the index into an Index of python date objects (object dtype) rather than a Pandas DatetimeIndex, so you lose DatetimeIndex conveniences (time-based slicing, .tz_* methods, frequency inference, and efficient vectorized datetime ops). It also does not collapse multiple intraday snapshots into a single daily value — if you had several position records per day they will now share the same index value (duplicates) and you must explicitly aggregate (e.g., last/mean) before computing daily metrics. Finally, timezone and time-of-day information are discarded, which is fine if you intentionally evaluate at the calendar-day granularity but destructive if you later need intraday sequencing.
If the goal is to preserve a DatetimeIndex while zeroing the time component (which keeps datetime functionality and performance benefits), prefer positions.index = positions.index.normalize() or positions.index = positions.index.floor(‘D’) or convert to midnight UTC. Those alternatives give you a true DatetimeIndex at 00:00 for each day while achieving the same alignment intent but without losing datetime methods or performance characteristics.
fig, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(positions.replace(0, np.nan).dropna(how=’all’, axis=1).T,
cmap=sns.diverging_palette(h_neg=20, h_pos=200), ax=ax, center=0);
First you create the plotting canvas sized to be wide and readable so the time-axis and many instruments can be displayed without crowding; a large figure helps when you have high-frequency timestamps or dozens of assets. The next step is a small but important data-cleaning pipeline applied to the positions matrix before plotting: zeros are replaced with NaN and then any column that is entirely NaN is dropped. Replacing zeros with NaN intentionally removes neutral/no-position cells from the visualization so the plot highlights only times when the strategy actually held a long or short exposure. Dropping columns that end up all-NaN removes assets that were never traded, preventing unnecessary clutter and focusing the viewer on instruments that contributed to behavior and P&L.
The data is then transposed so that each asset becomes a row and time becomes the horizontal axis; this layout is more natural for inspection because it lets you scan across time for a single instrument and compare position patterns vertically across instruments. The heatmap itself uses a diverging color palette and is centered at zero, which is the key interpretive choice: centering ensures that long and short positions are colored symmetrically around the neutral point, so you can immediately distinguish sign (long vs short) and see relative magnitudes by color intensity. Because NaN values are rendered as empty, you get a sparse, focused visual that emphasizes position changes and holding patterns rather than constant flats.
In the context of quantitative strategy evaluation, this plot is a quick diagnostic for several performance-related questions: it shows where exposures concentrate across assets and time (position concentration risk), how often and when the strategy flips sign (turnover and directional changes), and typical holding durations for signals. Interpreting these patterns alongside return and risk metrics helps validate whether observed P&L drivers align with the intended allocation logic and whether position sizing or rebalancing frequency needs tuning. If you need to compare magnitudes across experiments, consider fixing the color scale (vmin/vmax) so different runs are directly comparable; otherwise the automatic scaling will emphasize internal variation within the plotted window.
positions.head()
Calling positions.head() is a quick sanity-check step in the backtest pipeline: it prints the first few rows of the positions table so we can inspect the initial state the strategy produced before we compute any performance metrics. Conceptually, positions is the time series that maps each timestamp to the exposure the strategy held (could be one column for quantity, or multiple columns for per-asset sizes, cost basis, realized/unrealized PnL, etc.). By looking at the head we confirm the schema and initial conditions — column names, index type and ordering, units (shares vs. notional), and whether the first rows are zeros, NaNs, or nonzero holdings created by an opening trade.
Why this matters: the first rows determine how we treat warm-up periods and how we align positions with price/return series for PnL calculations. If the head shows NaNs for derived columns (e.g., rolling statistics or indicator-aligned positions) we know we must trim or fill them before computing returns to avoid skewing metrics. If the index is unsorted, has duplicate timestamps, or uses an unexpected timezone/frequency, downstream aggregations (cumulative returns, drawdown, daily Sharpe) will be wrong — head() makes those problems obvious early. If sign conventions or units are inconsistent (longs positive vs negative), catching that here prevents inverted PnL and incorrect risk exposures in metric calculations.
Operationally, after inspecting head() we typically follow with a few checks: assert the index is monotonically increasing and aligned with the price series, ensure numeric dtypes for position and PnL columns, and decide whether to forward-fill or drop initial NaNs. Those choices influence how we compute realized vs. unrealized PnL, turnover, and exposure-weighted metrics. In short, positions.head() is not about learning Python syntax — it’s about validating the shape and initial state of the position time series so the subsequent quantitative strategy evaluation (PnL attribution, Sharpe, drawdown, exposure and turnover analysis) is built on clean, correctly-aligned data.
transactions.info()
Calling transactions.info() is the quick, diagnostic step we run first to understand the structure and health of the transactions DataFrame before any quantitative calculations. Internally pandas inspects the DataFrame’s index and each column and prints a compact summary: the index dtype and range, the total number of rows, for every column the non‑null count and the column dtype, and an overall memory usage estimate. This is purely observational — it doesn’t change the data — but it tells you at a glance whether columns that must be numeric, datetime, or categorical are typed correctly and whether there are missing values that need handling.
We rely on that summary because the accuracy and performance of downstream performance metrics hinge on those details. For example, if trade amounts or prices are object/string dtypes instead of numeric, aggregations (PnL, turnover, realized/unrealized calculations) will fail or produce incorrect results; if timestamps are not datetime types, time‑based resampling and rolling windows for returns and volatility will be impossible or wrong. Non‑null counts reveal partial fills, canceled trades, or missing metadata that could bias metrics like trade frequency, average holding time, or Sharpe ratio if not addressed. The memory usage hint is also important for backtests and vectorized calculations: large, inefficient dtypes (object for symbols, 64‑bit floats where 32‑bit suffice) can bloat memory and slow groupby/resample operations.
Practically, you use info() to decide concrete preprocessing steps: convert columns reported as object to numeric or datetime with to_numeric / to_datetime, cast high‑cardinality strings to category where appropriate, and impute, filter, or otherwise address columns with unexpected null counts. For large datasets, call info(memory_usage=’deep’) to get an accurate footprint before deciding on dtype downsizing. Also remember info() prints to stdout and returns None, so for programmatic checks complement it with transactions.dtypes, transactions.isnull().sum(), .describe(), and selective value_counts to quantify the issues you observed.
In short: transactions.info() is the low‑cost reconnaissance that guides your cleaning and type‑casting decisions so subsequent aggregation, resampling, and metric computations are correct, efficient, and reproducible.
Mean-Reversion Backtest with Portfolio Optimization
# setup stdout logging
format_string = ‘[{record.time: %H:%M:%S.%f}]: {record.level_name}: {record.message}’
zipline_logging = NestedSetup([NullHandler(level=DEBUG),
StreamHandler(sys.stdout, format_string=format_string, level=INFO),
StreamHandler(sys.stdout, format_string=format_string, level=WARNING),
StreamHandler(sys.stderr, level=ERROR)])
zipline_logging.push_application()
log = Logger(’Algorithm’)This block configures application-wide logging for the algorithm runner so that runtime events, trading decisions, and errors are captured with the right level of detail and routed to the appropriate output streams. The overall intent is to make backtests and live runs auditable and debuggable while avoiding an avalanche of low-level noise that would obscure important signals used when evaluating quantitative strategy performance.
The format_string defines a compact, high-resolution timestamped line format: hours, minutes, seconds and microseconds, followed by the log level and message. The microsecond precision is deliberate — in quantitative evaluation you often need to order events precisely (e.g., order submissions vs. fills, market ticks, execution latencies) to compute correct PnL, slippage and other timing-sensitive metrics.
NestedSetup is being used to assemble multiple handlers and then activate them as the application-level logging configuration. The first handler is a NullHandler at DEBUG: this effectively absorbs DEBUG-level records so they don’t accidentally print to the console. Keeping DEBUG messages suppressed by default reduces noise during normal runs, yet having the handler present makes it straightforward to reconfigure or capture debug output when you need a deeper investigation.
Two StreamHandlers write to stdout: one configured at INFO and one at WARNING, both using the format_string. Routing INFO and WARNING to stdout keeps ordinary operational messages (strategy decisions, periodic performance summaries) together. There is also a StreamHandler to stderr at ERROR: sending errors to stderr separates critical failures from informational output, which is useful for monitoring, alerting and piping outputs in production environments. Note that having two handlers both sending to stdout at adjacent levels is usually unnecessary unless you intend different formatting or separate processing for warnings; if you don’t need that distinction, a single INFO-level handler would cover INFO and WARNING.
push_application() activates this composed setup so the handlers apply globally to loggers created afterward. Finally, Logger(‘Algorithm’) creates the algorithm’s logger instance which the strategy code will use to emit messages under this configuration. In practice this gives you consistent, timestamped traces of algorithm behavior that are essential for reconstructing trades, measuring latency and aggregating performance metrics while keeping non-essential debug chatter out of normal outputs.
Algorithm Settings
# Settings
MONTH = 21
YEAR = 12 * MONTH
N_LONGS = 50
N_SHORTS = 50
MIN_POS = 5
VOL_SCREEN = 1000These five lines are pure configuration values that define the time scale, portfolio construction rules, and a liquidity filter for the strategy and for any downstream performance calculations. The most important thing to understand up front is that MONTH is set to 21 because we treat a “month” as 21 trading days; YEAR is derived from that as 12 * MONTH = 252 trading days, which is the conventional trading-day count used to annualize returns and volatilities. Any rolling windows, vol estimators, or annualization factors in the backtest should use YEAR (252) for converting per-period statistics into yearly metrics (e.g., multiply daily mean returns by YEAR for annualized return, multiply daily volatility by sqrt(YEAR) for annualized volatility).
N_LONGS and N_SHORTS (both 50) determine how many names the strategy will hold on each side when forming a long/short portfolio. In practice these are the “top-N” and “bottom-N” selection parameters: you rank the universe by the signal, take the top 50 as longs and the bottom 50 as shorts. This explicitly encodes your diversification and concentration trade-off: increasing N improves diversification and statistical stability of metrics, but tends to dilute per-name alpha and can increase turnover; decreasing N concentrates exposure and can exaggerate idiosyncratic noise in both performance and risk estimates. These counts also feed directly into P&L attribution and risk decomposition: position-level returns will be averaged or aggregated across N_LONGS/N_SHORTS when computing portfolio-level return, active exposure, and sector concentration metrics.
MIN_POS = 5 is a guardrail that prevents pathological small-n portfolios. It forces at least 5 positions on a side (or in total, depending on how you apply it) before the algorithm commits capital. The practical reason is twofold: statistical significance and risk control. With very few positions your sample of returns is too small for meaningful Sharpe/t-stat calculations and you expose the portfolio to outsized idiosyncratic risk or single-name failures. MIN_POS also interacts with your selection logic: if there aren’t at least MIN_POS names that pass any liquidity or signal thresholds, the engine should either scale down leverage, skip the rebalance, or fall back to alternate logic — this needs to be implemented consistently so backtest metrics don’t get biased by intermittent exposures.
VOL_SCREEN = 1000 is a liquidity (or volume) threshold used to screen the trading universe. Typically this represents a minimum average daily traded volume (or dollar volume) and ensures you only consider sufficiently liquid instruments to avoid unrealistic capacity and market-impact assumptions. The key point here is to confirm the unit (shares per day, thousands of shares, or dollars) and the averaging window used to compute it; whichever choice you make will materially affect universe size and thus return, turnover, and slippage estimates. Filtering by VOL_SCREEN improves the realism of performance metrics and stabilizes execution cost models; the higher the threshold, the more capacity you are implicitly targeting, but the smaller and more concentrated the opportunity set becomes.
Finally, think of these values as tightly coupled hyperparameters that shape both the signal deployment and how you evaluate results. MONTH/YEAR control time-aggregation and annualization (and any rolling lookbacks for vol/alpha smoothing). N_LONGS/N_SHORTS and MIN_POS determine the portfolio’s breadth and the reliability of statistical measurements like mean returns, Sharpe, and information ratio. VOL_SCREEN constrains capacity and influences realized turnover and slippage in backtests. When you interpret performance metrics, always report the configuration (these five values) and be prepared to stress-test them: change N_LONGS/N_SHORTS to see how concentrated bets drive performance, vary VOL_SCREEN to test capacity sensitivity, and confirm that annualization uses YEAR=252 so reported Sharpe and volatility numbers are comparable to industry conventions.
start = pd.Timestamp(’2013-01-01’, tz=UTC)
end = pd.Timestamp(’2017-01-01’, tz=UTC)
capital_base = 1e7These three assignments declare the two configuration knobs that anchor the entire backtest/analysis: the historical time window to evaluate the strategy, and the initial cash that defines portfolio scale. The start and end timestamps set the evaluation horizon used to fetch market data, generate signals, and compute performance metrics. Specifying a timezone (UTC) is intentional: it forces a single, unambiguous time coordinate for all time-series operations so that joins, resampling, and event alignment don’t suffer from DST shifts or mismatched local times from multiple data providers. Practically, this window determines which trades are possible, which corporate actions and market regimes are included, and the sample size used to estimate statistics such as annualized return, Sharpe, and drawdown. (Note: different backtest engines treat endpoints differently — confirm whether end is inclusive or exclusive for your framework.)
capital_base = 1e7 sets the portfolio’s starting cash to ten million units (e.g., USD). This is a scale parameter with important implications: percentage-based metrics (returns, Sharpe, annualized volatility) are invariant to linear scaling of all positions, but absolute metrics — dollar P&L, maximum drawdown in currency terms, margin usage, and realized fees — depend directly on it. Choosing a large capital like 10M is often done to emulate institutional sizing, to avoid excessive discretization from minimum lot sizes or per-trade fixed costs, and to reveal constraints (e.g., position limits, borrowing/margin) that only appear at scale. Conversely, your transaction-cost model matters: if costs are per-share or per-order fixed fees, scaling capital will change cost as a fraction of portfolio and therefore can change net performance. Similarly, capital interacts with any position-sizing logic (fixed-dollar vs fixed-fraction): fixed-dollar sizing will produce different position profiles than fractional sizing when you change capital.
Operationally, treat these values as tunable inputs to experiments. Always verify the data feed completely covers [start, end] in UTC and that price series are adjusted consistently (splits/dividends) before computing metrics. When interpreting results, separate scale-dependent outcomes (absolute P&L, dollar drawdown, margin calls) from scale-invariant risk/return ratios, and run sensitivity checks on capital_base and the date window to ensure your conclusions about strategy quality are robust across different sample periods and portfolio sizes.
Mean-Reversion Factor
class MeanReversion(CustomFactor):
“”“Compute ratio of latest monthly return to 12m average,
normalized by std dev of monthly returns”“”
inputs = [Returns(window_length=MONTH)]
window_length = YEAR
def compute(self, today, assets, out, monthly_returns):
df = pd.DataFrame(monthly_returns)
factor = df.iloc[-1].sub(df.mean()).div(df.std())
out[:] = factorThis factor is producing a single cross-sectional score per asset that quantifies how extreme the most recent monthly return is relative to that asset’s own trailing 12‑month distribution. We request Returns with window_length=MONTH and set the factor window_length to YEAR, so the engine hands compute a 2‑D array containing 12 monthly returns per asset (time axis × asset axis). Converting that ndarray to a pandas DataFrame is just a convenience: columns correspond to assets and each row is one historical month, which makes columnwise statistics (mean, std, and the last row) straightforward and explicit.
The core calculation reads the latest month’s return (df.iloc[-1]) and subtracts the 12‑month mean for each asset, then divides by the 12‑month standard deviation: (latest − mean) / std. That produces a z‑score of the most recent monthly return relative to the asset’s own historical mean and volatility. Dividing by the standard deviation normalizes for volatility so that signals are comparable across assets with different return dispersion; without that step, high‑volatility names would dominate any cross-sectional ranking even if their latest return is not particularly extreme in standardized terms.
From a strategy perspective this is a mean‑reversion signal: large positive z‑scores mean the asset’s latest return is unusually high relative to its recent history (a candidate for short or reduced weight under mean reversion), while large negative z‑scores indicate an unusually low recent return (a candidate for long). Because the factor is standardized per asset, it is appropriate for cross‑sectional ranking, aggregation with other factors, and calculation of performance metrics like Information Coefficient (IC), turnover, or return attribution without being dominated by volatility differences.
A few practical caveats and potential improvements tied to quantitative evaluation: pandas’ std uses sample std (ddof=1) by default, which matters with only 12 observations; if you want population normalization use ddof=0 or an EWMA volatility estimator to downweight old data. The code does not guard against zero or near‑zero std, so assets with constant returns can produce inf/NaN — add a small epsilon or mask those cases. You may also want winsorization or robust scaling (median/MAD) to reduce sensitivity to outliers before feeding this into a portfolio construction routine. Finally, the computed Series is assigned into out[:] to satisfy the factor API’s expected 1‑D output aligned with the engine’s asset ordering, so ensure there’s a consistent mapping between DataFrame columns and the engine’s asset list.
Create a pipeline
The Pipeline created by compute_factors() returns a table with two columns, long and short, each listing 25 stocks — one column for the largest positive deviations and the other for the largest negative deviations — of their most recent monthly return from its annual average, with deviations normalized by the standard deviation. The Pipeline also limits the universe to the 500 stocks with the highest average trading volume over the previous 30 trading days.
def compute_factors():
“”“Create factor pipeline incl. mean reversion,
filtered by 30d Dollar Volume; capture factor ranks”“”
mean_reversion = MeanReversion()
dollar_volume = AverageDollarVolume(window_length=30)
return Pipeline(columns={’longs’ : mean_reversion.bottom(N_LONGS),
‘shorts’ : mean_reversion.top(N_SHORTS),
‘ranking’: mean_reversion.rank(ascending=False)},
screen=dollar_volume.top(VOL_SCREEN))This small function builds a self-contained factor pipeline that outputs both the discrete trading signals used to construct a long/short portfolio and a continuous ranking of securities for analysis. We start by creating a MeanReversion object that encapsulates our mean-reversion score for each asset (the signal that says which names are unusually high or low relative to their recent history). Separately we compute a 30‑day Average Dollar Volume and use it as a screen: the pipeline will only consider the top VOL_SCREEN names by 30‑day dollar volume. That screening step is critical up front because it constrains the universe to liquid names, which reduces execution costs, avoids spurious signals from illiquid issues, and produces more stable factor statistics for evaluation.
Within the screened universe we derive three outputs. The .bottom(N_LONGS) call on the mean‑reversion factor selects the N_LONGS lowest‑scoring securities (the ones we expect to mean‑revert upward) to be candidate longs; .top(N_SHORTS) selects the N_SHORTS highest‑scoring names as candidate shorts. Producing explicit long and short columns like this maps directly to tradeable, backtestable positions and makes it straightforward to compute realized P&L, turnover, transaction costs, and portfolio-level performance metrics (Sharpe, drawdowns, etc.) for a long‑short construct. The third column, mean_reversion.rank(ascending=False), captures a continuous ranking of the factor across the universe. Keeping the rank as a separate column is important for quantitative evaluation: it lets you compute rank‑based diagnostics such as rank Information Coefficient (Spearman IC), rank decay/turnover, factor concentration, and to experiment with alternative portfolio constructions (e.g., weights proportional to rank rather than equal‑weighted top/bottom buckets). The ascending=False argument is deliberately set so the orientation of the rank matches the sign convention used downstream in our performance calculations (i.e., it aligns the direction of “higher rank” with how we interpret expected returns in the rest of the system).
In short, this pipeline encodes the business decision to evaluate a mean‑reversion factor on a liquidity‑screened universe and to produce both actionable long/short selections for backtesting and a ranked signal for diagnostic metrics. That combination makes it easy to measure not only raw returns but also factor quality (IC, hit rates), stability (turnover), and implementation friction (impact and slippage) when performing quantitative strategy evaluation.
Before_trading_start() ensures the pipeline executes daily and that its results — including current prices — are recorded.
def before_trading_start(context, data):
“”“Run factor pipeline”“”
context.factor_data = pipeline_output(’factor_pipeline’)
record(factor_data=context.factor_data.ranking)
assets = context.factor_data.index
record(prices=data.current(assets, ‘price’))This function is the daily setup step that pulls the pre-computed factor results into the algorithm, snapshots a couple of key fields, and makes them available for the rest of the trading day. The first line calls pipeline_output(‘factor_pipeline’) to retrieve the DataFrame that the pipeline produced for this run; that DataFrame is expected to have one row per asset (index) and columns for whatever factor outputs the pipeline emits — typically raw factor values and a derived ranking. We assign that DataFrame to context.factor_data so downstream logic (rebalance handlers, risk checks, order sizing) can read a stable, single-day snapshot rather than re-running or re-fetching during market hours.
Immediately after fetching the pipeline, the code records the ranking column via record(factor_data=context.factor_data.ranking). This is not a debugging print but a persistence hook into the backtest/telemetry system: recording the rankings each day creates a time series you can use later to evaluate factor behaviour (e.g., Information Coefficient, rank decay, cross-sectional performance by quantile). Recording daily snapshots of the factor exposure is essential for quantitative strategy evaluation because it ties model signals to realized performance and lets you compute the usual performance metrics (IC, factor returns, turnover, etc.) consistently across days.
Next we build the universe for price lookup by taking assets = context.factor_data.index. Using the pipeline’s index ensures price lookups and subsequent position decisions only involve securities the pipeline actually returned (and implicitly respects the pipeline’s filters and masks). The final call, record(prices=data.current(assets, ‘price’)), grabs a synchronized price vector for those same securities at the market-open snapshot and records it. Persisting these prices alongside the recorded rankings lets you compute forward returns (price changes over your target horizon) aligned to the signal timestamp — that alignment is the core requirement for unbiased IC and factor return calculations.
Why this structure matters: pipelines are designed to run once per day and provide a clean, consistent signal snapshot; storing that snapshot in context avoids inconsistent reads during the trading day and ensures portfolio construction uses the exact pipeline output. Recording both ranks and prices at this point gives you the minimal dataset to later compute performance metrics (rank-to-return correlations, long-short factor returns, turnover from rebalancing, etc.) and to debug signal/price mismatches. A couple practical considerations: pipeline_output can include NaNs or removed (delisted) assets, so downstream code should handle missing values before placing orders; data.current will error if passed unknown instruments, so ensure the pipeline’s index matches tradable assets; and if you need forward returns beyond the next close, persist the appropriate price snapshots and schedule return computations consistently.
Setting Up Rebalancing
The new rebalance() method submits trade orders to exec_trades() for assets flagged by the pipeline for long and short positions, using equal but opposite weights. It also divests any current holdings that are no longer included in the factor signals:
def exec_trades(data, positions):
“”“Place orders for assets using target portfolio percentage”“”
for asset, target_percent in positions.items():
if data.can_trade(asset) and not get_open_orders(asset):
order_target_percent(asset, target_percent)This small function is the execution bridge between the portfolio optimizer (which produces target percentages) and the trading engine that actually moves capital. Its role is to iterate the desired allocation map and, for each asset, attempt to convert the target percent into real orders only when it makes sense to do so. Effectively it asks, “is this asset tradable right now?” and “do we already have pending activity for this asset?” before instructing the broker to move toward the target allocation.
Concretely, the first check (can_trade) filters out assets that are currently impossible or undesirable to touch — for example, because of halts, lack of liquidity, or trading restrictions in the backtest/market-session model. This is important for realistic strategy evaluation: trying to place orders for non-tradable assets would either fail or produce biased performance numbers, so the check keeps execution realistic and avoids artificially optimistic fills. The second check (not get_open_orders) prevents you from submitting duplicate or conflicting orders while a previous order for the same asset is still outstanding; that avoids accidental double-sizing, flip-flopping allocations, and complicated partial-fill interactions that would distort turnover and transaction-cost metrics.
When both checks pass the code calls order_target_percent with the requested target. That call is where the target percent is translated into an actual order size — taking into account current portfolio value, cash, current position in the asset, and rounding to tradable lot sizes — so the portfolio moves toward the specified weight. Using target-percent style orders simplifies rebalancing logic and performance attribution because you express intent in relative terms (risk or capital exposure) rather than absolute share counts, which makes P&L and turnover calculations cleaner across changing portfolio value.
From the perspective of quantitative strategy evaluation and reporting, this execution policy has direct effects on the performance metrics you will compute. It constrains what trades are attempted (affecting realized/unrealized P&L), limits duplicate submissions (reducing artificial turnover and unnecessary transaction costs), and enforces market realism through tradability checks (reducing look-ahead or execution bias). However, it also leaves several practical details implicit — partial fills, slippage, rounding, margin/leverage handling, and how simultaneous orders are prioritized — which will all affect metrics like effective capacity, slippage-adjusted returns, turnover, and trade-level attribution. For faithful evaluation you should ensure the environment implementing can_trade, get_open_orders, and order_target_percent models these execution frictions and logs sufficient information (timestamps, intended vs filled sizes, fees) so performance metrics reflect true implementable results.
def rebalance(context, data):
“”“Compute long, short and obsolete holdings; place orders”“”
factor_data = context.factor_data
assets = factor_data.index
longs = assets[factor_data.longs]
shorts = assets[factor_data.shorts]
divest = context.portfolio.positions.keys() - longs.union(shorts)
exec_trades(data, positions={asset: 0 for asset in divest})
log.info(’{} | {:11,.0f}’.format(get_datetime().date(),
context.portfolio.portfolio_value))
# get price history
prices = data.history(assets, fields=’price’,
bar_count=252+1, # for 1 year of returns
frequency=’1d’)
# get optimal weights if sufficient candidates
if len(longs) > MIN_POS and len(shorts) > MIN_POS:
try:
long_weights = optimize_weights(prices.loc[:, longs])
short_weights = optimize_weights(prices.loc[:, shorts], short=True)
exec_trades(data, positions=long_weights)
exec_trades(data, positions=short_weights)
except Exception as e:
log.warn(’{} {}’.format(get_datetime().date(), e))
# exit remaining positions
divest_pf = {asset: 0 for asset in context.portfolio.positions.keys()}
exec_trades(data, positions=divest_pf)This function is the periodic rebalancer: it turns the current factor signals into concrete trades so the live portfolio reflects the strategy’s intended long and short exposures, and it guards the portfolio against stale or unmodeled positions that would distort P&L and performance metrics.
We begin by reading the latest factor signals from context.factor_data and deriving the universe of assets under consideration. The code uses boolean masks on that index to split the universe into the current long candidates and the current short candidates. Immediately after determining the new signal set, we compute “divest” — the set difference of the existing portfolio positions and the union of the new longs and shorts. That tells us which assets are no longer signaled and therefore should be closed out. We call exec_trades with positions set to zero for each of those assets to remove obsolete holdings right away. This step prevents old, unintentional exposures from contaminating subsequent optimization and performance reporting.
Next we record a snapshot of portfolio value for monitoring/logging, then fetch a one-year price history for all assets in the universe. The 252+1 day window is deliberate: one trading year of daily bars provides a stable sample for return and volatility estimation, and the +1 is commonly used to allow calculation of returns or to align lagged windows when needed by the optimizer. This historical series is the raw input for the weight-optimization step that follows.
Before running an optimizer we enforce a practical stability guard: only run optimization if there are more than MIN_POS candidates on each side. Requiring a minimum number of longs and shorts avoids degenerate solutions and overfitting (an optimizer with too few instruments can produce unstable weights that blow up risk and make performance metrics meaningless). When the guard passes, the code calls optimize_weights separately for longs and shorts — passing the long-price history to produce long-side target positions and the short-price history with short=True to produce short-side targets (short=True likely alters constraints or signs so these weights represent short exposure). The optimizer’s job (not shown) is to convert return/volatility/correlation information from the price history into portfolio weights that satisfy the strategy’s objectives and constraints (risk targets, leverage limits, neutrality constraints, etc.), which is why we supply a long stable history rather than only recent ticks.
We then place the computed target trades by calling exec_trades for the long and short weight dictionaries. All of this is wrapped in a try/except so that an optimization failure (numerical issues, solver errors, missing data) does not cause the entire scheduler to crash. On exception we log a warning with the date and the error. Importantly, after the try/except block the function constructs a final divest_pf from the current positions and calls exec_trades to set those to zero. That final exit ensures the portfolio is not left in a partially updated or inconsistent state after a failed optimization or other runtime error; in effect, if we can’t compute reliable new targets we fall back to a clean slate to avoid unintended exposures that would corrupt subsequent performance and risk evaluation.
In terms of the overall goal — quantitative strategy evaluation and performance metrics — this flow ensures the live portfolio always reflects valid factor signals or else is safely de-risked. Closing obsolete positions reduces drift between signals and holdings, the one‑year history gives the optimizer a robust statistical base for risk-aware weight construction, the MIN_POS check prevents unstable, low-degree-of-freedom fits, and the conservative exception handling plus final divestment prevents silent contamination of P&L that would bias backtests and live metrics. If you want to tighten things further for evaluation purposes, consider (a) making the optimizer return diagnostics so we can record objective values and constraint slacks into the performance database, (b) differentiating partial rollbacks (e.g., keep long side if short-side optimization fails) versus full divestment, and © explicitly preserving execution timestamps so realized turnover and slippage can be included in the strategy’s performance calculations.
Portfolio Weight Optimization
def optimize_weights(prices, short=False):
returns = expected_returns.mean_historical_return(
prices=prices, frequency=252)
cov = risk_models.sample_cov(prices=prices, frequency=252)
# get weights that maximize the Sharpe ratio
ef = EfficientFrontier(expected_returns=returns,
cov_matrix=cov,
weight_bounds=(0, 1),
solver=’SCS’)
ef.max_sharpe()
if short:
return {asset: -weight for asset, weight in ef.clean_weights().items()}
else:
return ef.clean_weights()This function’s job is to produce a portfolio weight vector that (in the normal case) maximizes the Sharpe ratio using historical price data; the returned weights are what you’d plug into downstream performance calculations (expected portfolio return, volatility, Sharpe, drawdown, etc.). The data flow is straightforward: we first convert raw prices into the two inputs the optimizer needs — an annualized expected return vector and an annualized return covariance matrix — then hand those to a mean-variance optimizer that finds the Sharpe-maximizing allocation under long-only constraints, and finally we clean and return the resulting weights.
Concretely, expected_returns.mean_historical_return(prices, frequency=252) computes an annualized expected return for each asset from historical price series (frequency=252 uses trading days to annualize). We do this because the optimizer’s objective and the common performance metrics (annualized return, annualized volatility, Sharpe) all work on the same annualized scale; choosing 252 ensures consistency with standard reporting. The covariance is estimated with risk_models.sample_cov(prices, frequency=252), producing an annualized sample covariance matrix of asset returns. This covariance captures pairwise risk and is the key input determining portfolio volatility and diversification effects. Note that both are simple historical estimators — they’re quick and reasonable as a baseline, but they are subject to estimation noise; in production you may want shrinkage estimators (Ledoit–Wolf), robust methods, or out-of-sample validation to reduce overfitting.
We then construct an EfficientFrontier object with those return and covariance inputs and with weight_bounds=(0, 1). That bound enforces long-only positions and typically a sum-to-one constraint (i.e., a fully invested portfolio), which ensures the solution is interpretable as a conventional long-only portfolio with no leverage. The solver=’SCS’ argument selects a convex solver that handles the quadratic programming problem reliably for the chosen constraints; different solvers have different numerical behavior, so SCS can be a pragmatic choice for stability with these settings. Calling ef.max_sharpe() instructs the optimizer to maximize (E[R] — r_f)/sigma — the Sharpe ratio — under the provided constraints. By default the risk-free rate is zero unless otherwise specified, so the optimizer effectively maximizes expected return per unit volatility.
ef.clean_weights() converts the raw optimizer output into a neat mapping of asset → weight by zeroing tiny numerical noise and rounding for readability; this is the weight dictionary returned to callers in the normal (short=False) path. If short=True the function returns the negative of those cleaned weights. That negation is a simple shortcut to produce a short portfolio with the same relative exposures but inverted signs: a long-only solution summing to +1 becomes a net short portfolio summing to -1. Important caveats here: flipping signs is not the same as re-optimizing under true short or market-neutral constraints. The original optimization assumed long-only bounds and no borrowing/leverage costs; simply negating weights will invert expected returns and volatilities and may produce a portfolio that violates your actual trading constraints (margin, shorting limits, or a desire for market-neutral exposures). If you need a genuine short-capable optimization, it’s better to set appropriate weight_bounds (e.g., negative lower bound) and re-run the optimizer so the objective and constraints reflect shorting costs and limits.
From the perspective of Quantitative Strategy Evaluation and Performance Metrics, this function is a building block: it yields the candidate allocation whose in-sample Sharpe the optimizer maximized, and those weights feed directly into calculations of portfolio expected return (weights · expected_returns), portfolio variance (weightsᵀ Σ weights), and realized P&L tracking. Because it uses plain historical means and sample covariance, you should evaluate the resulting strategy out-of-sample, consider regularization/shrinkage and transaction cost constraints, and be explicit about the risk-free rate and leverage assumptions when reporting Sharpe or other metrics.
Initialize the Backtest
The rebalance() method runs according to the date_rules and time_rules provided to the schedule_function() utility. It executes at the beginning of the week, immediately after market_open, as defined by the built-in US_EQUITIES calendar (see docs for details on rules).
You can specify a trade commission both in relative terms and as a minimum amount. You can also define slippage — the cost incurred by an adverse price change between the trade decision and its execution.
def initialize(context):
“”“Setup: register pipeline, schedule rebalancing,
and set trading params”“”
attach_pipeline(compute_factors(), ‘factor_pipeline’)
schedule_function(rebalance,
date_rules.week_start(),
time_rules.market_open(),
calendar=calendars.US_EQUITIES)
set_commission(us_equities=commission.PerShare(cost=0.00075, min_trade_cost=.01))
set_slippage(us_equities=slippage.VolumeShareSlippage(volume_limit=0.0025, price_impact=0.01))initialize is configuring the backtest’s engine so the strategy’s signals, execution timing, and realism of trading costs are all defined up front — which is critical for reliable quantitative strategy evaluation and for obtaining performance metrics you can trust.
First, the code attaches a factor pipeline under the name ‘factor_pipeline’. compute_factors() builds the cross-sectional computations (factor transformations, filters, and any universe selection) that you will use to score and rank securities. Attaching the pipeline registers those computations with the engine so they are executed automatically each market day and their outputs are available (e.g., via pipeline_output) before the market opens. This separation ensures your signal-generation is reproducible and avoids look-ahead: factors are computed once per day, then used by your trading logic to form orders for that day.
Next, the schedule_function call wires a weekly rebalancing job that runs at market open on the first trading day of each week according to the US equities calendar. The choice of week_start and market_open encodes an execution cadence and latency assumption: you want decisions updated weekly, using the freshest pre-market factor outputs, and then turned into orders immediately at the open. Picking a weekly cadence is a balance between responsiveness to new information and controlling turnover/costs; doing it at market open ensures the scheduled function sees the pipeline outputs computed earlier that morning and that fills happen at a consistent time for performance attribution.
Finally, the two setters define the simulated transaction cost model, which directly affects realized P&L and all downstream performance metrics (net returns, Sharpe, information ratio, turnover, and drawdowns). set_commission applies a per-share commission with a minimum per trade; this prevents tiny micro-trades from appearing artificially cheap and forces the strategy to internalize fixed per-trade friction. set_slippage attaches a VolumeShareSlippage model that constrains execution relative to market liquidity (volume_limit) and scales price impact with order size (price_impact). Together these parameters impose realistic liquidity constraints and market impact on each fill, which penalizes high turnover and large relative-position changes and therefore produces more conservative and credible estimates of net alpha and risk-adjusted performance.
In short: attach_pipeline produces daily factor signals, schedule_function determines when those signals are converted into trades, and the commission/slippage settings make the execution realistic. These pieces together control signal timing, execution assumptions, and cost realism — all of which are essential to evaluating whether a factor or portfolio truly produces repeatable, robust performance after trading frictions.
Run the algorithm
Calling run_algorithm() executes the algorithm and returns a DataFrame containing the backtest performance.
backtest = run_algorithm(start=start,
end=end,
initialize=initialize,
before_trading_start=before_trading_start,
bundle=’quandl’,
capital_base=capital_base)
This single call kicks off a full simulated backtest run and returns the execution artifacts you need for quantitative strategy evaluation. run_algorithm orchestrates the algorithm lifecycle over the historical window defined by start and end: it first invokes initialize once to let your algorithm set up state (e.g., register scheduled functions, create pipelines, set commission/slippage models or initial parameters), then steps through each trading day in the selected data bundle, calling before_trading_start at the top of every trading day to refresh daily signals and pipeline results before any orders are placed for that day. If your algorithm publishes intraday logic (handle_data or minute callbacks), the engine will continue into the appropriate intraday event loop for each date; otherwise it will proceed with the end-of-day ordering flow.
Specifying bundle=’quandl’ selects the historical pricing and corporate-action dataset the backtest will read. That choice matters because the bundle determines which assets exist on which dates, how prices are adjusted for splits/dividends, and whether delisted assets are present — factors that directly affect realized returns, turnover analysis, and survivorship-bias considerations when you evaluate strategy performance. capital_base sets the simulated starting cash/equity for the portfolio; this normalizes absolute P&L and is the denominator when computing portfolio-level metrics such as returns, drawdown, and leverage. In short, capital_base affects scaling of dollars and therefore how you interpret percentage returns and risk statistics.
As run_algorithm advances through the date range it executes your order logic against a simulated market: it applies the configured execution model (slippage, commission, fill rules), updates positions and cash, and records every trade, position snapshot, and portfolio-level timeseries. The function returns a backtest object (timeseries of portfolio value, returns, P&L and typically supplemental artifacts such as transactions and positions) that you then use to compute performance metrics — cumulative returns, annualized return, volatility, Sharpe, maximum drawdown, turnover, and any custom diagnostics. Because initialize and before_trading_start are the only guaranteed hooks before trading each day, design them to prepare the state you need for robust signal generation and avoid look-ahead; likewise choose your bundle and capital_base deliberately since they materially affect metric interpretation and the validity of conclusions you draw from the backtest.
Extract inputs for pyfolio
The extract_rets_pos_txn_from_zipline utility in pyfolio extracts the data required to compute performance metrics.
returns, positions, transactions = extract_rets_pos_txn_from_zipline(backtest)This single call is the hand-off that turns a Zipline backtest object into the three core datasets you need for quantitative evaluation: time-series returns, the position history, and the trade/transaction log. Conceptually the function reads the internal backtest state (portfolio accounting, fill events, fees, and performance trackers) and materializes those internal structures into analyzable pandas objects so you can compute performance and risk metrics off-platform. We do this extraction because the backtest itself is an execution engine; for robust strategy evaluation you want independent, well-shaped representations of realized returns, exposures over time, and the atomic trade events that caused changes in those exposures.
Returns: this output is the market and/or portfolio return series (often a Series of portfolio-level returns and sometimes a DataFrame of per-asset returns). Its purpose is to represent mark-to-market P&L per unit time (daily by default in Zipline). When extracting returns the function typically normalizes their alignment to the trading calendar, handles NaNs, and clarifies whether these are simple or log returns and whether transaction costs have been applied. Why that matters: the types and alignment of returns determine every downstream metric (cumulative return, annualized return, Sharpe, drawdowns), so the extractor must preserve the correct sign, frequency, and cost adjustment so your performance calculations aren’t biased.
Positions: this is a time-indexed snapshot of holdings — quantities, notional exposures, and commonly derived fields such as market value or weight in the portfolio for every timestamp. Positions are crucial for exposure and risk analysis (sector/asset class exposure, factor exposure, concentration) and for computing turnover and unrealized P&L. The extraction step typically fills forward positions across market hours/days so you can align position snapshots with return timestamps, and may convert raw share counts into notional values in the base currency. The extractor may also annotate positions with instrument identifiers and cost basis so you can reconcile realized vs. unrealized P&L.
Transactions: this is the event-level ledger of executed orders — timestamps, asset identifiers, executed amount (shares), executed price, commissions/fees, and any slippage or order metadata (order id, fill status). Transactions let you compute realized P&L, total transaction costs, per-trade cost statistics, and execution quality. They are also the primary input for turnover and for causally linking position changes to specific trades. The function often normalizes timestamp timezones, aggregates partial fills, and ensures fees are correctly attributed to the trade that incurred them.
How the pieces work together: transactions are the discrete actions that change positions; positions are the state that gets marked-to-market each period; returns are the period-to-period P&L resulting from mark-to-market changes plus realized effects of transactions and costs. For rigorous evaluation you use returns for portfolio-level metrics (cumulative return, volatility, Sharpe, max drawdown), positions for exposure and risk decomposition (factor attribution, concentration, average leverage), and transactions for cost analysis and turnover. The extraction deliberately separates these concerns so you can (a) validate accounting by reconciling cumulative P&L from transactions + unrealized changes against the returns series, (b) run risk models on positions independent of trade noise, and © compute trade-level statistics and slippage models.
Practical checks and pitfalls to watch for after extraction: confirm index alignment and frequency between returns and positions; verify whether returns are net-of-costs or gross (and whether you need to add fees back in for some analyses); inspect transaction timestamps for timezone or calendar mismatches; look for stale positions or canceled orders that may appear in the transaction log; and ensure the cost basis and commission fields match your commission model. Immediately validate by summing realized P&L from transactions + end-of-day unrealized P&L from positions to match cumulative portfolio equity from the returns series — if those don’t reconcile, you’ve likely got an alignment or fee attribution issue to fix before computing metrics.
In short: this extraction is the canonical demarcation between the backtest engine and the analysis layer. It provides the normalized returns, position history, and transaction ledger you need to calculate accurate performance metrics, attribute P&L, quantify trading costs, and run risk/turnover analyses.
Persist Results for Use with pyfolio
with pd.HDFStore(’backtests.h5’) as store:
store.put(’returns/pf_opt’, returns)
store.put(’transactions/pf_opt’, transactions)This block opens an on-disk HDF5 container (backtests.h5) and writes two pandas objects into it: one under the hierarchical key returns/pf_opt and the other under transactions/pf_opt. Conceptually this is taking the two primary outputs of a backtest — the time series of portfolio returns and the transaction log — and persisting them in a central, indexed store so downstream evaluation, reporting and reproducibility can use the exact same inputs.
The context manager pattern ensures the file is flushed and closed when the writes complete, reducing risk of partial writes or file corruption. Using HDF5 here is a deliberate choice for time-series and tabular backtest outputs because pandas’ HDFStore preserves indices and dtypes, is space-efficient for large numeric arrays, and provides much faster read/write than e.g. CSV for bulk pandas objects. The hierarchical keys (returns/… and transactions/…) encode both the data type and the strategy identifier (pf_opt), which makes it straightforward to store multiple strategies or backtest runs in one file and to retrieve them by key when computing performance metrics or doing cross-strategy comparisons.
A few operational implications that explain why this exact call pattern matters: store.put will write (and overwrite if present) the object in one shot — that’s appropriate when you have a finished, deterministic output to persist for later metric calculations. If you were incrementally generating rows (e.g., streaming transactions) you’d instead use append/format=’table’, but append is slower and not necessary for finalized backtest artifacts. Also note HDF5 files are not designed for concurrent multi-writer access; writes should be serialized by the process that owns the backtest run. Finally, because downstream routines (cumulative return, annualized volatility, Sharpe, drawdown, trade-level P&L attribution) rely on precise indices and dtypes, storing returns and transactions as pandas objects ensures those routines operate on stable, reproducible inputs rather than re-running the backtest or re-parsing raw logs.
with pd.HDFStore(’backtests.h5’) as store:
returns_pf = store[’returns/pf_opt’]
tx_pf = store[’transactions/pf_opt’]
returns_ew = store[’returns/equal_weight’]
tx_ew = store[’transactions/equal_weight’]The with-block opens an HDF5-backed pandas store and, in one safe operation, loads four data objects into memory: the return series and the transaction log for the optimized portfolio (pf_opt), and the return series and transaction log for the equal-weight benchmark (equal_weight). Using the context manager ensures the file handle is closed automatically when the reads complete, which prevents file corruption and resource leaks when you work with large backtest archives.
We read both “returns/…” and “transactions/…” because they serve distinct but complementary roles in performance evaluation. The returns objects give the period-by-period P&L time series you’ll use to compute primary metrics — cumulative return, annualized return, volatility, Sharpe, max drawdown, and time-series diagnostics. The transactions objects contain the discrete trade records (size, side, price, timestamp, possibly fees or slippage fields) needed to compute turnover, realized transaction costs, market impact, and to reconcile gross versus net returns. Combining these two types of data is how you move from raw backtest outputs to credible, economically adjusted performance metrics.
Practically, loading both strategies’ returns and trades side-by-side supports apples-to-apples comparisons. You’ll typically verify their indices align (same date/frequency/timezone), confirm there are no duplicate timestamps, and ensure transactions reference the same universe and units as the returns series. From there you can produce adjusted returns by subtracting per-trade costs derived from the transactions log, calculate turnover and cost drag statistics from tx_pf and tx_ew, and compute benchmark-relative metrics (excess returns, active risk) using the equal-weight series as the baseline. Reading the data from hierarchical keys like returns/pf_opt and transactions/equal_weight also reflects a reproducible storage layout: returns and transaction logs are kept separately but linked by strategy name, which simplifies batch comparisons across many backtests.
A couple of practical cautions tied to quantitative evaluation: because HDF5 can store very large objects, consider selecting date ranges or columns if memory is constrained (store.select or chunked reads) rather than loading everything at once. Also validate field semantics — e.g., whether the returns are arithmetic or log returns and whether transaction records already include fees — so your downstream metric computations (net returns, turnover-normalization, execution cost attribution) are correct. In short, this block is the ingestion step that brings both the realized performance series and the trade-level evidence into memory so you can perform rigorous, auditable strategy evaluation and construct economically meaningful performance metrics.
Plot results
fig, axes= plt.subplots(nrows=2, figsize=(14,6))
returns.add(1).cumprod().sub(1).plot(ax=axes[0], title=’Cumulative Returns’)
transactions.groupby(transactions.dt.dt.day).txn_dollars.sum().cumsum().plot(ax=axes[1], title=’Cumulative Transactions’)
sns.despine()
fig.tight_layout();
We first create a two-row figure so we can view the strategy’s realized performance and its trading activity side-by-side. The top axis is dedicated to cumulative returns: the code takes the periodic returns series (or DataFrame), converts each period’s return into a growth factor with add(1), multiplies those factors across time with cumprod to get the running portfolio growth factor, and then subtracts 1 to convert back to a cumulative return (i.e., total percent change from the start). This sequence — (1 + r_t) → cumulative product → −1 — is the canonical way to accumulate compounding returns and is used so the plotted curve represents how $1 invested at the start would have grown over time. If returns is a multi-column DataFrame, these operations vectorize across columns so you can plot multiple strategies on the same axis.
The bottom axis visualizes the cumulative dollar volume of transactions by day. Here we extract the day component from each transaction timestamp (transactions.dt.dt.day), group transactions by that day-of-month, sum the txn_dollars within each group, and then take a cumulative sum across the grouped results. The cumsum shows how trading dollars have accumulated over the month — useful for spotting concentrated execution days, drift in cash flow, or whether trading intensity is front- or back-loaded. Note an important caveat: grouping by transactions.dt.dt.day collapses across months (all “15th”s get aggregated together) and may also produce out-of-order indices; if the goal is a chronological cumulative curve, prefer grouping by the full date (e.g., transactions.dt.date or resampling by day) and ensure the index is sorted before cumsum. Depending on what you want to measure, you might also take the cumulative absolute transaction dollars to measure turnover magnitude rather than net cash flow.
Finally, sns.despine() cleans up the plot aesthetics by removing top/right spines for a cleaner visualization, and fig.tight_layout() adjusts spacing so titles and axes don’t overlap. Together, these plots pair a clear cumulative performance metric (compounded returns) with an execution/flow metric (cumulative transactions), which is a common and useful juxtaposition when evaluating strategy efficacy, cost impacts, and execution behavior during Quantitative Strategy Evaluation.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 8), sharey=’col’)
returns_ew.add(1).cumprod().sub(1).plot(ax=axes[0][0],
title=’Cumulative Returns - Equal Weight’)
returns_pf.add(1).cumprod().sub(1).plot(ax=axes[1][0],
title=’Cumulative Returns - Mean-Variance Optimization’)
tx_ew.groupby(tx_ew.dt.dt.day).txn_dollars.sum().cumsum().plot(ax=axes[0][1],
title=’Cumulative Transactions - Equal Weight’)
tx_pf.groupby(tx_pf.dt.dt.day).txn_dollars.sum().cumsum().plot(ax=axes[1][1],
title=’Cumulative Transactions - Mean-Variance Optimization’)
fig.suptitle(’Equal Weight vs Mean-Variance Optimization’, fontsize=16)
sns.despine()
fig.tight_layout()
fig.subplots_adjust(top=.9)
This block builds a 2x2 comparison figure that juxtaposes the economic outcomes (left column) and trading activity (right column) of two portfolio construction approaches: Equal Weight (top row) and Mean-Variance Optimization (bottom row). The layout is deliberate: each column aggregates the same metric for both strategies so you can directly compare how the strategies trade off returns and transaction volume, and sharey=’col’ ensures the two return plots use a common vertical scale and likewise the two transaction plots share a scale — this makes visual comparisons meaningful.
For the performance plots, the code starts from series of periodic returns and converts them into a cumulative performance curve by doing add(1).cumprod().sub(1). That sequence is important: adding 1 turns each period return r into a growth factor (1+r), cumulative product compounds those factors multiplicatively (correctly modeling reinvestment and geometric return behavior), and subtracting 1 converts the resulting cumulative growth factor back to an excess return relative to the starting capital. We use multiplicative compounding because financial returns accumulate multiplicatively over time; using a simple cumulative sum would misrepresent compounded performance and mislead comparisons between strategies.
On the transaction side, the code aggregates trade dollar amounts to show cumulative transaction exposure over time. For each strategy it groups transactions by day, sums txn_dollars within that day, and then takes a cumulative sum to produce a running total of traded dollars. That cumulative transaction series is a proxy for turnover and realized trading volume — a direct input to estimating transaction costs and slippage that will erode gross returns. One operational caveat to flag: grouping by tx.dt.dt.day as written groups by day-of-month only, which will collapse same calendar days across different months (e.g., all “15”s), which is usually not what you want. If the intent is to group by calendar date, use the full date (e.g., dt.date, dt.normalize(), or dt.floor(‘D’)) to avoid aggregating across months.
Finally, the code adds descriptive titles for each subplot, a supertitle to frame the comparison, calls seaborn.despine() to remove the top/right axes for cleaner visuals, and uses tight_layout plus a top adjustment to ensure the supertitle does not overlap the subplots. In the context of quantitative strategy evaluation, these four panels together let you assess both gross performance and the cost/turnover profile: you can see whether an optimizer improves compounded returns relative to an equal-weight baseline and whether any improvement is offset by higher cumulative trading (and hence higher transaction costs), which is critical for judging net, implementable performance.


