Mastering Algo Trading with Python: From LSTMs to Bayesian Neural Networks

A comprehensive guide to building, backtesting, and quantifying risk in financial time-series models using Keras and Pyro.

Dec 16, 2025

∙ Paid

Download source code using the button at the end of this article!

Algorithmic trading is often portrayed as a “get rich quick” scheme, but in reality, it is a rigorous discipline of signal processing, risk management, and statistical validation. Many developers start by training a simple neural network to predict price movements, only to find that high accuracy on a training set rarely translates to profitable trades in a live, noisy market. The missing link is often not the complexity of the model, but the inability to quantify uncertainty. In this article, we will go beyond simple point-predictions. We will build a complete end-to-end pipeline: starting with a standard Deep Learning LSTM model for direction prediction, moving into a robust backtesting framework to calculate Sharpe ratios, and finally, advancing to Bayesian Neural Networks using Pyro. By the end, you won’t just have a model that tells you what to buy, but a system that tells you how confident it is — allowing you to build algo trading strategies that survive the chaos of real-world markets.

This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.

from backtest import *
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

from keras.models import Sequential
from keras.models import Model
from keras.layers.core import Dense, Dropout, Activation, Flatten, Permute, Reshape
from keras.layers import Merge, Input, concatenate, GaussianNoise
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers import Convolution1D, MaxPooling1D, GlobalAveragePooling1D, GlobalMaxPooling1D, RepeatVector, AveragePooling1D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger, EarlyStopping
from keras.layers.wrappers import Bidirectional, TimeDistributed
from keras import regularizers
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import *
from keras.optimizers import RMSprop, Adam, SGD, Nadam
from keras.initializers import *
from keras.constraints import *
from keras import regularizers
from keras import losses

from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import matthews_corrcoef

This import block is the scaffolding for building, training, and evaluating deep-learning models for time-series backtesting using pandas dataframes. The intended data flow and decisions are: load your price/features into pandas, convert to numpy windows shaped (samples, timesteps, features), feed those into Keras models that extract temporal patterns, produce one-step or sequence predictions, and finally evaluate those predictions against the truth and translate them into backtest P&L. Each group of imports supports one stage of that flow and encodes specific design choices to control learning dynamics, prevent overfitting, and produce robust evaluation.

First, the backtest and pandas/numpy imports imply the entry point: you’ll read/prepare data with pandas and numeric operations with numpy, then pass sliding windows or sequences into the models. The common pattern is: compute feature columns (returns, indicators) in pandas, standardize/normalize them (important to prevent exploding gradients or poor convergence), then create overlapping windows (e.g., lookback T -> X shape (N, T, F)) and corresponding targets y aligned carefully to avoid lookahead (critical for valid backtests).

The Keras layer imports show the architectural choices available and why you might pick each. Recurrent layers (LSTM, GRU, SimpleRNN) capture temporal dependencies — LSTM/GRU for longer-term patterns and vanishing-gradient mitigation. Convolution1D and pooling are present to capture local temporal motifs (short-term patterns) and to downsample features before feeding an RNN or dense head. TimeDistributed and RepeatVector support sequence-to-sequence or encoder-decoder mappings (for example, forecasting multiple future steps). Bidirectional wrappers are available but must be used with caution in backtesting: they learn from both past and future in a window, which can introduce lookahead bias unless your windowing guarantees causality (typically avoid Bidirectional for forward-testing/live trading). Flatten, GlobalAveragePooling1D, and GlobalMaxPooling1D provide different ways to collapse time dimensions: pooling can be more robust to temporal shifts, whereas Flatten preserves positional information — choose based on whether absolute alignment matters for your signal.

A variety of regularization and stability tools are imported because overfitting is a big risk with financial data. Dropout, GaussianNoise, and keras regularizers help reduce overfitting and make models more robust to noise. BatchNormalization stabilizes and speeds up training by reducing internal covariate shift. Advanced activations (PReLU, LeakyReLU, ELU, etc.) and different initializers can improve gradient flow and convergence. Constraints (e.g., max-norm) can keep weights within sensible ranges. These are the “why”: they allow you to extract signal without letting the model latch onto spurious, non-repeatable patterns that will blow up during live trading.

The imports for optimizers (RMSprop, Adam, SGD, Nadam) and losses indicate you’ll experiment with optimization dynamics; Adam/Nadam are commonly effective defaults for noisy, non-convex, small-batch training typical of time-series, while SGD with momentum can be useful when you want more controlled convergence. ReduceLROnPlateau, EarlyStopping, and ModelCheckpoint are the training management tools: reduce LR when the validation metric plateaus to escape shallow minima, stop early to avoid overfitting, and checkpoint the best weights to ensure you can backtest the best-performing model, not just the last epoch. CSVLogger gives you reproducible training logs for later analysis and plotting.

Finally, the sklearn metrics (r2_score, confusion_matrix, classification_report, matthews_corrcoef) reflect two common evaluation modes in backtesting: regression-style targets (price/return prediction) and classification-style outputs (directional up/down signals). R2 is a general regression goodness-of-fit; confusion matrices and classification reports summarize directional hit rates, precision, recall and support; Matthews correlation is particularly useful for imbalanced classes (common when signals are sparse). In a backtest, translate these predictive metrics into economic metrics: align model predictions back to timestamps, simulate fills/slippage/transaction costs, and compute returns, Sharpe, drawdown, and other portfolio-level statistics. Use train/validation splits that respect time order (no random shuffles) and test on out-of-sample contiguous time blocks to avoid lookahead.

Practical cautions tied to these imports: always standardize using only training-window statistics to avoid leakage, prefer causal model constructions over bidirectional ones unless you are explicitly modeling centered windows for non-causal analysis, choose pooling vs. flattening based on whether temporal alignment is meaningful, and use callbacks to avoid excessive training cycles on noisy financial targets. Together, these imports give you flexible building blocks to construct CNN/RNN hybrids, regularize them, control training, and evaluate both statistical and economic performance in a pandas-centered backtesting workflow.

symbol = ‘LTC1518’
bars = pd.read_csv(’./data/%s.csv’ % symbol, header=0, parse_dates=[’Date’])

This small block parameterizes which instrument’s historical file to load and then brings that CSV into a DataFrame as a time-aware table. The first line chooses the symbol so the rest of the code can be generic — we build the file path from the symbol so the same loading logic can be reused for different instruments during a backtest. That indirection matters in backtesting because you’ll often iterate the same pipeline across many symbols or switch datasets without changing the file-reading logic.

The read_csv call does two important things for downstream backtesting work. By specifying header=0 we ensure the first row of the file is interpreted as column names; that makes subsequent column lookups predictable (for example, locating an explicit “Date”, “Open”, “Close”, etc.). More crucially, parse_dates=[‘Date’] converts the “Date” column from plain strings into pandas datetime objects. Having a true datetime dtype is necessary for correct time-series behavior — it enables fast, vectorized operations like resampling, rolling-window calculations, time-based indexing and slicing, alignment between price data and trade signals, and reliable calculation of returns or time deltas. Without parsing dates up front you risk subtle bugs and performance issues when you later try to group by day, align multiple series, or compute elapsed time between events.

Practically, this approach assumes the CSV contains a “Date” column in a parseable format and that the header row is present and accurate; if those assumptions don’t hold you’ll need to adjust parsing options (for example to handle different date formats, missing headers, or timezone information). Also consider the next usual step in a backtest: setting the parsed Date column as the DataFrame index and ensuring it’s sorted and unique, which makes lookups and resampling more efficient and avoids logic errors when constructing time-based signals or matching trades to bars.

START_TRAIN_DATE = ‘2015-01-01’
END_TRAIN_DATE = ‘2017-12-31’
START_TEST_DATE = ‘2018-01-01’
END_TEST_DATE = ‘2018-03-09’
LOOKBACK = 7
STEP = 1
FORECAST = 1
INIT_CAPITAL = 10000
STAKE = 10

These constants establish the experiment’s temporal boundaries, the way we turn raw price series into supervised training examples, and the simple money-management rules we’ll use in the backtest. START_TRAIN_DATE and END_TRAIN_DATE define the contiguous interval we treat as the training set; START_TEST_DATE and END_TEST_DATE define the later interval used for out-of-sample evaluation. Choosing a non‑overlapping test window that begins after the training window (as done here) is intentional: it prevents look‑ahead leakage and mimics a realistic deployment scenario where models trained on historical data are applied to future data.

LOOKBACK, STEP and FORECAST control how we convert the time series into X (features) and y (targets). LOOKBACK = 7 means each feature vector will summarize the previous seven time steps — this is our receptive field for the model, chosen to capture one week of history in a daily series. STEP = 1 controls the sampling stride inside that lookback: with STEP = 1 we use every intervening observation, producing overlapping windows and maximizing training samples; increasing STEP would subsample and reduce sample correlation and compute cost. FORECAST = 1 sets the prediction horizon to one time step ahead, so targets are derived by shifting the series forward by one index relative to the last lookback observation. Together these parameters determine alignment (how X rows map to the corresponding y values) and therefore how you construct rolling windows, avoid label leakage, and schedule when predicted trades would be executed in the backtest.

INIT_CAPITAL and STAKE define the simple portfolio rules the backtester will apply. INIT_CAPITAL = 10000 seeds the simulated account balance and is used for metric normalization (e.g., percent return) and constraints (you cannot buy beyond available cash unless leveraging is allowed). STAKE = 10 is the fixed position size per buy signal — we’re using a flat sizing rule (N shares per trade) rather than fractional or risk‑based sizing. Using a fixed stake simplifies comparison between strategies but has clear risk implications: larger stakes increase exposure and drawdown, and stake should be tuned relative to INIT_CAPITAL and the asset’s volatility.

Operationally, the data flow is: load the full price series, slice it by the defined train/test date ranges to produce two DataFrames, then for each slice create supervised examples by rolling a window of length LOOKBACK with stride STEP and assigning targets by shifting by FORECAST. Those predictions are then turned into trade signals which the backtester simulates against the portfolio that starts with INIT_CAPITAL and applies trades of size STAKE. Keep in mind practical details that affect correctness: ensure your DataFrame index timezone and frequency align with these date strings (inclusive/exclusive behavior when slicing), account for market holidays/illiquid days when forming windows, and remember that STEP/LOOKBACK choices interact with autocorrelation and sample size — tuning them affects model bias/variance and the statistical validity of backtest results.

train_set = bars[(bars[’Date’] > START_TRAIN_DATE) & (bars[’Date’] < END_TRAIN_DATE)]

This line selects the rows from the historical bars table that fall strictly between two date boundaries and assigns that slice to train_set so subsequent training logic only sees past data. Internally, pandas evaluates two vectorized boolean expressions against the Date column — one testing whether each timestamp is greater than START_TRAIN_DATE and the other whether it is less than END_TRAIN_DATE — and combines them with an elementwise AND to produce a boolean mask. Applying that mask to bars returns a DataFrame containing only the rows where both conditions are true; that subset is what we treat as the training period for backtesting.

The reason we extract a contiguous date range rather than, say, randomly sampling rows is fundamental to backtesting: models must be trained on data that would have been available historically, with no information leakage from the future. The choice of strict inequalities here (“>” and “<”) is intentional: it excludes rows exactly equal to the boundary timestamps, which is useful when you want a clean temporal separation between train and validation/test periods. Be explicit about whether boundaries should be inclusive or exclusive in your experiment design — commonly we use start inclusive and end exclusive to make it easy to chain consecutive, non-overlapping windows.

A few practical considerations to keep this selection robust. Ensure bars[‘Date’] is a true datetime dtype (and consistently timezone-aware or naive) so the comparisons behave as expected; if it’s strings you’ll get lexicographic surprises. Make sure the DataFrame is sorted by date and that you understand how duplicate timestamps or different intraday time components affect inclusion. For large datasets, this boolean masking is vectorized and efficient, but if you frequently slice by date it can be more convenient and often faster to use a DatetimeIndex with .loc-based slicing. Finally, remember that boolean indexing returns a copy; if you plan to mutate train_set afterwards, either work on that copy intentionally or use an explicit .copy() to avoid SettingWithCopyWarning and unintended side effects.

test_set = bars[(bars[’Date’] > START_TEST_DATE) & (bars[’Date’] < END_TEST_DATE)]

This line builds a test subset by applying a boolean mask to the original bars DataFrame: it keeps only the rows whose Date falls strictly after START_TEST_DATE and strictly before END_TEST_DATE. Concretely, pandas evaluates the two vectorized comparisons (bars[‘Date’] > START_TEST_DATE) and (bars[‘Date’] < END_TEST_DATE), combines them element-wise with & to form a boolean Series, and uses that Series to slice bars into test_set. The intent in the backtest workflow is to extract an out-of-sample window for performance evaluation — isolating data that the model or strategy did not see during training or calibration to avoid lookahead or leakage.

A few practical reasons and caveats behind this choice: using strict inequalities (> and <) intentionally excludes the boundary timestamps, which is useful when you want to ensure no overlap with a training/validation endpoint or when boundary bars are handled separately (e.g., to avoid double-counting a bar at the split time). Because these comparisons are vectorized, the operation is efficient for large time series, but it relies on the Date column being comparable to START_TEST_DATE and END_TEST_DATE (typically both as pandas Timestamps or timezone-normalized datetimes). If Date is not datetime64 or has a different timezone/format, the comparison semantics can be wrong, so normalizing with pd.to_datetime and aligning timezones before slicing is important. Finally, for clarity and alternate semantics you could use .between(start, end, inclusive=…) when you want an inclusive or explicit boundary policy; also consider sorting or setting Date as the index if you perform many time-based slices to make subsequent code clearer and faster.

def create_dataset(data):
    
    highp = pd.to_numeric(data.ix[:, ‘High’])
    lowp = pd.to_numeric(data.ix[:, ‘Low’])
    openp = pd.to_numeric(data.ix[:, ‘Open’])
    closep = pd.to_numeric(data.ix[:, ‘Close’])
    tradesp = pd.to_numeric(data.ix[:, ‘Trades’])
    volumep = pd.to_numeric(data.ix[:, ‘Volume’])

    normal_close = closep.values.tolist()
    normal_open = openp.values.tolist()

    highp = highp.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()
    lowp = lowp.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()
    openp = openp.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()
    closep = closep.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()
    tradesp = tradesp.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()
    volumep = volumep.pct_change().replace(np.nan, 0).replace(np.inf, 0).values.tolist()

    X, Y = [], []
    
    for i in range(0, len(data), STEP): 
        try:
            o = openp[i:i+LOOKBACK]
            h = highp[i:i+LOOKBACK]
            l = lowp[i:i+LOOKBACK]
            c = closep[i:i+LOOKBACK]
            v = volumep[i:i+LOOKBACK]
            t = tradesp[i:i+LOOKBACK]

            y_i = (normal_close[i+LOOKBACK+FORECAST] - normal_open[i+LOOKBACK]) / normal_open[i+LOOKBACK]
            y_i = 1 if y_i > 0 else 0
            
            x_i = np.column_stack((o, h, l, c, v, t))
    
        except Exception as e:
            break

        X.append(x_i)
        Y.append(y_i)

    X, Y = np.array(X), np.array(Y)
    return X, Y

This function transforms a historical price-and-volume DataFrame into a set of supervised samples for backtesting a binary classifier: each X is a multivariate time window of recent percent-change features and each Y indicates whether a subsequent open→close return is positive. The high-level flow is: coerce columns to numeric, derive two parallel representations (raw prices for label calculation and percent-change series for features), slide a LOOKBACK-length window through the percent-change series in steps of STEP, compute a binary label using a future open and close separated by FORECAST, assemble the feature matrix for each window by stacking the six percent-change series, and finally return X and Y as numpy arrays.

The initial numeric coercion ensures downstream arithmetic won’t break on non-numeric entries. The code keeps two views of the data intentionally: normal_open/normal_close retain raw price levels (lists of absolute prices) because the label is a level-based return between a future open and a future close, while the other series are converted to percent-change (pct_change) to produce stationarized, scale-invariant features. Using percent changes for open/high/low/close/volume/trades makes the model focus on relative movements rather than absolute price scale — this is important in backtesting across instruments or long histories because it reduces heteroskedasticity and prevents the model from relying on trivial scale differences.

After pct_change, NaN and infinite values are replaced with zero. The pragmatic reason is to avoid NaNs propagating into feature windows or causing exceptions; however, note this replaces missing or extreme jump-induced values with neutral signals, which can bias samples if those cases are frequent. Converting the Series to plain Python lists is a design choice that simplifies slicing semantics in the loop but sacrifices some vectorized/numpy performance.

The loop is the sequential sampler: for each starting index i (stepped by STEP), it slices LOOKBACK consecutive percent-change values for each feature (open, high, low, close, volume, trades) to form the input window. Those slices represent the recent history that the model will see. The label y_i is computed using indices relative to the end of that window: normal_open[i+LOOKBACK] is the open price immediately after the lookback window, and normal_close[i+LOOKBACK+FORECAST] is the close price after advancing FORECAST days from that open. The return (close — open) / open is thresholded to 1 if positive and 0 otherwise, so the model is trained to predict whether the post-window open-to-future-close return will be positive. That particular choice of reference points means you are predicting an intra-day or multi-day movement anchored at the first day after the lookback, so be deliberate about how LOOKBACK and FORECAST are set relative to your trade entry/exit logic.

Feature assembly uses numpy.column_stack to produce a 2D array per sample with shape (LOOKBACK, 6): each row is a time step and each column corresponds to one of the six percent-change series, preserving temporal ordering. Samples and labels are collected into lists and converted to numpy arrays at the end so the returned X has shape (n_samples, LOOKBACK, 6) and Y has shape (n_samples,) — a format convenient for most ML pipelines and backtest simulators that consume sequential inputs.

There are a few operational caveats to watch for in a backtesting context. The try/except simply breaks when any indexing goes out of range; this prevents explicit boundary checks but also masks other exceptions and may hide logic errors — prefer explicit bounds computation or narrower exception handling. The use of .ix is deprecated and should be replaced with .loc/.iloc to make indexing semantics explicit. Replacing NaN/inf with 0 is an expedient but blunt imputation strategy; consider forward/backward filling or dropping samples with missing data if those occur non-randomly. Converting to Python lists and slicing in Python slows large backtests; keeping arrays/Series and using vectorized windowing or optimized rolling/window generators will be much faster for large datasets. Finally, ensure your STEP, LOOKBACK, and FORECAST constants align with your intended trade timing so the generated labels correctly represent the trades your strategy would take.

X_train, Y_train = create_dataset(train_set)
X_test, Y_test = create_dataset(test_set)

These two lines are the point where your raw, chronological market data (train_set and test_set) gets converted into the supervised learning examples the model actually consumes. create_dataset is performing the typical time-series-to-supervised transformation: it slides a fixed-length window over the input series and emits X rows that contain the lagged observations (the predictors) and Y rows that contain the corresponding future target(s) (the label or labels at a forecast horizon). In other words, for each training example the function packages the recent history into X_i and the next observation(s) you want to predict into Y_i. That windowing enforces the temporal ordering that backtests require — predictors always come from strictly earlier timestamps than the target — which prevents lookahead bias when you later train or evaluate a model.

We call create_dataset separately on train_set and test_set to preserve strict separation between fitting and evaluation. Doing the transform independently ensures that each partition is windowed only from its own data and that no overlapping windows or index leakage accidentally include future information from the training period into the test examples. It’s also the practical place to reconcile missing initial windows (drop or pad the first few rows that don’t have a full history), and to produce X and Y shapes that match what the learning algorithm expects (e.g., 2D arrays for classical models or 3D tensors for sequence models). A crucial operational point is that any normalization/feature-scaling must be fit on X_train only and then applied to X_test; create_dataset should either return unscaled windows or be documented to accept an externally fitted scaler so you don’t leak distributional information from the test set into training.

When you run model.fit(X_train, Y_train) and model.predict(X_test) during your backtest, the X/Y pairs created here determine the time resolution, forecast horizon, and sample alignment of every prediction you make. For reproducible and correct backtesting you must ensure create_dataset uses identical window size, target horizon, feature ordering and index-handling on both splits, and that it preserves or returns timestamp alignment (or indices) so you can map predicted Y back to the original test timestamps for metric computation. Common pitfalls to watch for are off-by-one windowing errors that shift targets, silently dropped rows that misalign timestamps, and accidental scaling or feature-engineering done differently between train and test — all of which would invalidate the backtest results.

def plot_history(history):
    # summarize history for accuracy
    plt.subplot(2, 1, 1)
    plt.plot(history.history[’acc’])
    plt.plot(history.history[’val_acc’])
    plt.axhline(y=0.5, color=’grey’, linestyle=’--’)
    plt.title(’model accuracy’)
    plt.ylabel(’accuracy’)
    plt.xlabel(’epoch’)
    plt.legend([’train’, ‘test’], loc=’upper left’)
    plt.show()
    # summarize history for loss
    plt.subplot(2, 1, 2)
    plt.plot(history.history[’loss’])
    plt.plot(history.history[’val_loss’])
    plt.axhline(y=0.693, color=’grey’, linestyle=’--’)
    plt.title(’model loss’)
    plt.ylabel(’loss’)
    plt.xlabel(’epoch’)
    plt.legend([’train’, ‘test’], loc=’upper left’)
    plt.show()

This small utility takes the training History object returned by Keras and turns its stored epoch-by-epoch metrics into two simple diagnostic charts: one for accuracy and one for loss. The function reads the sequences in history.history (the lists of values recorded every epoch) and plots them so you can visually compare training vs validation behavior across epochs — a core step before trusting any model you’ll later evaluate in a backtest.

First it draws the accuracy subplot. It plots the training accuracy series and the validation accuracy series on the same axes so you can immediately see whether the model is improving on held-out data or just memorizing the training set. A horizontal reference line is drawn at 0.5; for a binary classification problem this is the random-guess accuracy baseline, so the line helps you judge whether the model is doing meaningfully better than chance. Axis labels, a title and a legend are added to make the plot self-explanatory; showing the plot at this point displays the accuracy trend to the user.

Next it draws the loss subplot in the same manner: training loss and validation loss are plotted together, with a horizontal reference at ~0.693 (which is ln(2) and corresponds to the binary-crossentropy loss you’d expect from a 50/50 random prediction). That baseline is useful because loss is less intuitively interpretable than accuracy — seeing validation loss near or below this baseline confirms the model is learning a signal rather than predicting uniformly. The titles, labels and legend again make it clear which curve is which, and the plot is shown.

A few practical notes tied to backtesting: these plots are the primary tools for spotting overfitting before you commit a model to a backtest. If training accuracy improves while validation accuracy stalls or validation loss diverges upward, your model is likely learning noise from historical windows and will produce optimistic backtest results. Conversely, flat curves on both train and validation suggest underfitting and that you should increase model capacity or features. Also be aware of implementation details that affect robustness: the History keys can be named ‘acc’/’val_acc’ or ‘accuracy’/’val_accuracy’ depending on Keras/TensorFlow versions, and calling plt.show() after each subplot splits them into separate displays (you may prefer drawing both subplots and calling show() once). Finally, for noisy time-series/backtesting setups you may want to add smoothing, annotate the epoch with best validation metric, or persist figures so you can compare learning curves across different backtest folds.

def get_lr_model(x1, x2):
    main_input = Input(shape=(x1, x2, ), name=’main_input’)
    x = GaussianNoise(0.01)(main_input)
    x = Flatten()(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    output = Dense(1, activation = “sigmoid”, name = “out”)(x)
    final_model = Model(inputs=[main_input], outputs=[output])
    final_model.compile(optimizer=Adam(lr=0.001, amsgrad=True),  loss=’binary_crossentropy’, metrics = [’accuracy’])
    return final_model

This function builds and returns a compact neural classifier intended to produce a probability-like signal (binary decision) from a 2‑D input window. Conceptually, you pass a matrix (for example, a sliding window of time steps by features) into the network; the model flattens that window into a single feature vector and maps it to a single sigmoid output that you can interpret as the probability of a positive class (e.g., a buy signal) during backtests.

Flow and layer-by-layer rationale: the Input layer expects a shape (x1, x2) — typically that will be a window length and the number of features per time step. Immediately applying GaussianNoise(0.01) injects small random perturbations during training; in the context of financial backtesting this helps the model learn to ignore tiny, likely-irrelevant fluctuations and improves robustness to noisy live data, which reduces overfitting to idiosyncratic historical ticks. Flatten takes the time-by-feature grid and converts it into a single vector — this deliberately discards explicit sequential structure so the downstream Dense layer treats each time/feature cell as an independent input, which simplifies learning and keeps the model lightweight.

Next, BatchNormalization standardizes the flattened activations across the batch. This stabilizes and speeds up training by reducing internal covariate shift, which is especially useful when batch statistics can vary a lot in financial data. Dropout(0.5) follows as a strong regularizer: dropping half the units during training forces the model not to rely on any particular pattern in the historical window, again mitigating overfitting to historical quirks that won’t generalize in forward tests.

The final Dense(1, activation=’sigmoid’) produces a single output between 0 and 1, suited for binary classification or thresholded trading signals. Using binary_crossentropy as the loss aligns naturally with that output and encourages calibrated probabilities rather than raw scores. The optimizer is Adam with a conservative learning rate of 0.001 and amsgrad enabled; Adam handles noisy, sparse gradients well (typical in small financial datasets) and amsgrad can provide extra stability on non-stationary loss landscapes that are common when inputs change over time.

Practical considerations for backtesting: this is essentially a regularized, logistic‑regression–style neural model — simple and fast to train, but it loses temporal order information by flattening the window. If your strategy depends on patterns across time, consider sequence-aware layers (Conv1D/LSTM) instead. Also, accuracy as a metric can be misleading in imbalanced trade-label problems; prefer precision/recall, AUC, or — better yet — economic metrics (cumulative returns, Sharpe, drawdown) calculated on holdout walk-forward or out‑of‑sample periods. Finally, because the model contains internal normalization and noise, you should still ensure consistent input scaling in your pandas pipeline (e.g., standardization) so GaussianNoise and BatchNormalization operate at the expected scale.

In short: this function constructs a small, heavily regularized neural binary classifier that is tuned for robustness in noisy, low-sample financial settings common in backtesting. It trades modeling of temporal structure for simplicity and generalization, and its hyperparameters (dropout rate, noise level, optimizer settings, and the choice of output metric) are important levers to adjust when validating via walk‑forward/backtest experiments.

reduce_lr = ReduceLROnPlateau(monitor=’val_loss’, factor=0.9, patience=50, min_lr=0.000001, verbose=0)
checkpointer = ModelCheckpoint(filepath=”testtest.hdf5”, verbose=0, save_best_only=True)
es = EarlyStopping(patience=100)

model = get_lr_model(X_train.shape[1], X_train.shape[-1])
# model = get_model(X_train.shape[1], X_train.shape[-1])
model.summary()

history = model.fit(X_train, Y_train, 
              epochs = 1000, 
              batch_size = 64, 
              verbose=0, 
              validation_data=(X_test, Y_test),
              callbacks=[reduce_lr, checkpointer, es],
              shuffle=True)

model.load_weights(’testtest.hdf5’)
pred = model.predict(X_test)
plot_history(history)

pred = [1 if p > 0.5 else 0 for p in pred]
C = confusion_matrix(Y_test, pred)

print ‘MATTHEWS CORRELATION’
print matthews_corrcoef(Y_test, pred)
print ‘CONFUSION MATRIX’
print(C / C.astype(np.float).sum(axis=1))
print ‘CLASSIFICATION REPORT’
print classification_report(Y_test, pred)
print ‘-’ * 20

This block sets up a supervised training loop with conservative safeguards, fits the model, then evaluates it in ways that are useful for a backtest-oriented workflow. It begins by constructing three Keras callbacks whose behavior shapes training and model selection: ReduceLROnPlateau watches val_loss and reduces the optimizer learning rate by a factor of 0.9 when the validation loss stalls for 50 epochs (this gentle decay helps the optimizer take smaller steps when progress slows, rather than overshooting a narrow minimum). ModelCheckpoint writes the best-performing weights to testtest.hdf5 and save_best_only=True ensures we keep the single best snapshot by monitored metric; EarlyStopping with patience=100 halts training after the validation metric stops improving for that many epochs, preventing long overfitting runs while letting the model explore enough epochs (the high patience values imply you expect noisy, slow-to-converge validation curves).

Next you instantiate the model via get_lr_model(X_train.shape[1], X_train.shape[-1]). Passing those shape elements expresses the model’s expected input dimensions (the first dimension after batch typically represents sequence length or number of timesteps, and the last represents features per timestep). Calling model.summary() is a quick check to confirm the architecture and parameter counts so you can validate the design relative to the data shape before training.

The core training call is model.fit(…). It requests up to 1000 epochs with batch_size 64 but relies on the callbacks to stop earlier or adjust learning rate; that combination gives a high upper bound while letting ReduceLROnPlateau and EarlyStopping handle convergence and overfitting. validation_data=(X_test, Y_test) provides an immediate out-of-sample signal to the callbacks and to the logged history — important for model selection, but also a potential source of leakage if X_test/Y_test are intended as a final holdout for the backtest; in time-series/backtesting contexts you usually want a separate validation split or a walk-forward scheme rather than using the test fold as the monitored validation set. shuffle=True randomizes mini-batches each epoch, which helps general SGD convergence for iid data but is dangerous for time-series features (it destroys temporal order); if these features are sequential, consider disabling shuffle or using sequence-aware sampling.

After fit completes (either via epochs or early stop), model.load_weights(‘testtest.hdf5’) restores the checkpointed best weights so evaluation uses the best epoch rather than the final one. pred = model.predict(X_test) yields probabilities or scores; plot_history(history) visualizes loss/metric trajectories across epochs so you can inspect overfitting, learning-rate drops, and whether early stopping triggered appropriately. The code then thresholds probabilities at 0.5 to produce hard class predictions — this is fine for a balanced, well-calibrated output, but in trading/backtesting you often want to tune that threshold against a business metric (expected return, Sharpe, or cost-weighted error) rather than defaulting to 0.5.

Finally, the evaluation block computes confusion_matrix, Matthews correlation coefficient (MCC), and sklearn’s classification_report. MCC is a particularly useful single-number summary for imbalanced classes common in trading signals because it balances true/false positives and negatives. The printed confusion matrix is normalized per true class (C / C.astype(np.float).sum(axis=1)), which yields row-wise recall rates — that immediately tells you how well each true class is being recovered. Be careful with division-by-zero if a class has zero true samples; prefer dividing by row sums with keepdims or adding a tiny epsilon. classification_report gives precision/recall/f1 per class which helps diagnose whether the model is precise (few false signals) or has good recall (captures most events) — both are critical in backtesting because false positives cost execution and false negatives miss opportunities.

A couple of practical suggestions tied to backtesting: do not use your final test set as the validation set driving callbacks (use a separate validation set or time-series cross-validation/walk-forward splits), avoid shuffle=True for sequence-dependent features, and consider threshold search calibrated to P&L rather than F1. Also consider saving the model architecture and optimizer state and logging epoch-level metrics (loss, val_loss, and any financial metrics) so you can reproduce the best checkpoint and understand how metric trajectories map to out-of-sample trading performance.

pred = [1 if p == 1 else -1 for p in pred] # we need to change NN’s 0 output to -1 for our strategy
pred = [p if i % FORECAST == 0 else 0 for i, p in enumerate(pred)]
pred = [0.] * (LOOKBACK) + pred + [0] * FORECAST# first LOOKBACK items needed to make first forecast + items we shifted

This block is taking raw model outputs and turning them into a time-aligned sequence of tradable signals for the backtest, while preventing lookahead and limiting trades to the model’s forecast cadence. Start to finish: the first line maps the neural network’s binary output into the strategy’s action space. Your NN apparently emits 1 for “long” and 0 for “short/other”, but the trading logic expects symmetric signed positions (1 for long, -1 for short). Converting 0 -> -1 makes the signal directly usable as a position multiplier when you later compute PnL (e.g., position * returns), and ensures the signal encodes direction rather than a presence flag.

The second line enforces the forecast cadence: it keeps predictions only at indices where i % FORECAST == 0 and zeroes out everything else. This is important because the model either produces meaningful forecasts only every FORECAST timesteps (e.g., you trained it to predict every Nth step) or you only want to act when a new forecast is issued to avoid overtrading on intermediate outputs. Setting the intermediate entries to 0 means “no new trade/signal” at those timestamps, so downstream logic can distinguish between a deliberate short/long stance (-1/1) and a flat/no-action moment (0).

The third line aligns the prediction vector with the original price/feature timeline used in the backtest by padding. Prepending LOOKBACK zeros accounts for the fact that the model needed LOOKBACK prior samples to produce its first forecast — you cannot have a forecast for rows that provided the input. Appending FORECAST zeros at the end prevents the strategy from acting on forecasts that would require price data beyond the available horizon (i.e., future-targeted forecasts); it also keeps the prediction vector the same length as your price series so you can safely element-wise combine them in pandas without index mismatches. Using explicit zeros here prevents unintended trades before the model had sufficient history or after the data ends and makes the resulting array ready for conversion into a pandas Series/column for further steps like shifting, forward-filling, or multiplication by return series in the backtest.

class MachineLearningForecastingStrategy(Strategy):   
    
    def __init__(self, symbol, bars, pred):
        self.symbol = symbol
        self.bars = bars

    def generate_signals(self):
        signals = pd.DataFrame(index=self.bars.index)
        signals[’signal’] = pred
        return signals

This class is meant to encapsulate a model-driven trading strategy for use in a pandas-based backtester: you hand it the market bars (time-indexed price data) and a sequence of model predictions, and generate_signals should return a time-aligned DataFrame of trading signals that the backtest engine can consume. The constructor currently accepts three inputs — symbol, bars, and pred — and stores the symbol and bars so the object knows which market and which time index it will operate over. The intent is that pred contains the model’s forecasts (for example probabilities, expected returns, or categorical labels) that will be converted into actionable signals.

When generate_signals runs, it builds a new DataFrame using the same index as the bars. Using the bars’ index is deliberate: backtesting frameworks rely on identical time indices to align signals with the price series (so each signal corresponds to the correct bar). The method then writes the predictions into a column named ‘signal’ and returns that DataFrame. Naming the column ‘signal’ is also intentional — most backtesting harnesses expect a specific column name to find trade entry/exit instructions.

There are a couple of important behavioral and correctness considerations tied to this implementation. First, the current code references pred directly inside generate_signals rather than using an instance attribute (self.pred), so unless pred is a global variable or captured from an outer scope this will raise a NameError; the correct pattern is to store pred on the instance in __init__ so generate_signals can reliably access it. Second, you must ensure the predictions are aligned and shaped to match bars.index: the backtester will join on the index, so a length mismatch or differing timestamps results in misaligned signals, NaNs, or unintended forward/backward propagation. Therefore validate that pred either has the same index as bars or reindex/resample it deterministically before assigning into the signals DataFrame.

Beyond these mechanical fixes, there are practical signal-preparation steps you should consider as part of generate_signals so the downstream backtest behaves predictably. Convert continuous model outputs into the expected signal format (e.g., threshold probabilities into discrete -1/0/1 positions or scale predicted returns into position sizes), coerce types to numeric/integer if required, and decide how to handle NaNs at the start or inside the series (drop, fill, or keep as neutral). Finally, keep this method focused on alignment and simple mapping from predictions to signals — position sizing, risk management, and trade execution logic should remain separate so the strategy stays composable and easy to test.

# preparing for forecasting for tomorrow!
test_set[’Close’] = test_set[’Close’].shift(-FORECAST)

rfs = MachineLearningForecastingStrategy(’LTC’, test_set, pred)
signals = rfs.generate_signals()
portfolio = MarketIntradayPortfolio(’LTC’, test_set, signals, INIT_CAPITAL, STAKE)
returns = portfolio.backtest_portfolio()

This small block is preparing the test data so that your model’s predictions can be translated into tradable signals and then evaluated through a simulated intraday portfolio. The first line is the critical alignment step: test_set[‘Close’] = test_set[‘Close’].shift(-FORECAST). By shifting Close up by FORECAST periods you are creating a target column that represents the price FORECAST periods into the future (for example, tomorrow if FORECAST == 1). That re-labeling is what lets you compare features observed at time t with the price at time t + FORECAST — in other words, it creates the supervised learning target you need for forecasting and avoids accidentally using future price information as a feature. Be aware this produces NaNs at the bottom of the dataset (because there is no future Close for the last rows), so downstream code must drop or otherwise handle those rows to avoid misalignment or look-ahead bias.

Next, the code instantiates a MachineLearningForecastingStrategy with the instrument id, the shifted test_set, and pred (the model outputs). Conceptually this object is the translator between model outputs and trading intent: it should align the model predictions with the test_set timestamps and apply whatever decision logic you have (thresholds, probability cutoffs, mapping predicted price deltas into discrete actions). The call to generate_signals() is where that logic runs — it produces a time-indexed series or frame of trading signals (buy/sell/hold). At this stage you must ensure the predictions and the shifted Close share the same index and length; any mismatch is a common source of silent look-ahead or dropped rows during backtesting.

Those signals are passed into MarketIntradayPortfolio, together with the same test_set, an initial capital value, and a stake size. This portfolio wrapper simulates how the signals would be executed in the market (intraday implies entries and exits happen within the same trading day or within the bars covered by test_set). The portfolio object is responsible for applying execution rules (e.g., whether trades are placed at close, open, next-tick), updating cash and position sizes, and applying constraints like position sizing (STAKE) and portfolio-level capital tracking (INIT_CAPITAL). The final call backtest_portfolio() runs the simulation over the time series, executing the signals according to those rules and producing the realized P&L, returns time series, and any performance metrics.

Two practical points to keep in mind while you run or extend this flow: (1) avoid in-place surprises — shifting test_set mutates the dataframe, so consider working on a copy if you need the original Close series elsewhere; (2) explicitly handle the NaNs produced by the shift and confirm prediction alignment before generating signals, and (3) verify the portfolio’s execution assumptions (entry price, slippage, transaction costs, intraday timing) so the backtest reflects the real trading constraints you want to measure. Together, these steps move you from model outputs to a market-aware performance estimate using pandas-friendly time-aligned data.

returns[’signal’] = signals
our_pct_growth = returns[’total’].pct_change().cumsum()
benchmark_ptc_growth = test_set[’Close’].pct_change().cumsum()

First, we persist the trading decisions into the returns DataFrame by assigning signals to a new column named “signal”. This step is about bookkeeping: earlier logic would have produced a sequence of entry/exit or position-size signals (the variable signals) and we attach them to the same index that holds per-period P&L and portfolio state so downstream calculations and diagnostics can easily filter, group, or plot performance by signal state. Having the signals in the returns frame also ensures alignment on timestamps and avoids later mistakes where the signals and returns are offset by one step.

Next, the code computes our_pct_growth from the portfolio’s “total” series by calling pct_change().cumsum(). pct_change() produces period-to-period simple returns (relative change), and cumsum() then accumulates those period returns additively to produce a running total of percentage change. The intent here is to produce a time series that shows how much the portfolio value has moved over the test period, which is useful for quick visual comparisons and for measuring drift or trend in performance during backtests.

The benchmark_ptc_growth line applies the same operation to the benchmark price series (Close) from the test_set, so you end up with a comparable cumulative percentage-change series for the benchmark. By computing both series with the same method and then plotting or comparing them, you can quickly assess whether the strategy outperformed or underperformed the market during the test interval.

A couple of important behavioral notes and a recommended improvement: pct_change() produces NaN for the first row, so the first value of the cumulative series will be NaN unless you explicitly fill it (e.g., fillna(0)), which can affect visualizations and comparisons. More critically, summing simple returns with cumsum is an approximation that is only strictly valid for additive returns (or very small returns); for typical periodic returns you should compute cumulative growth multiplicatively using (1 + r).cumprod() — 1 to get correct compounded returns. Alternatively, if you prefer additive math, convert to log returns with np.log(series).diff().cumsum() which correctly accumulates continuous compounded returns. Finally, ensure the index alignment and frequency of returns and benchmark are consistent (same timestamps, no lookahead) before comparison — misaligned or mixed-frequency data is a common source of misleading backtest results.

plt.figure()
plt.plot(returns[’total’])
plt.show()

This three-line block creates a standalone Matplotlib figure, draws the time series held in returns[‘total’], and renders it to the screen. Conceptually, you’re taking the “total” column from a pandas DataFrame or Series (which in a backtest typically represents either a portfolio value series or the aggregated return series) and turning it into a visual inspection of performance over the backtest horizon. Matplotlib.plot consumes the Series values for the y-axis and, importantly, uses the Series index for the x-axis — so if your DataFrame is indexed by timestamps you’ll get a properly scaled time axis without extra work; if the index is numeric, the x-axis will be numeric instead. This is why the type and continuity of the index matter: a datetime index yields readable date ticks, while missing values in the Series produce gaps in the line, and non-unique or multi-index keys can change the plotted result or raise errors.

Calling plt.figure() up front ensures this plot is drawn on a fresh figure rather than being overlaid on any previously-created plots; that’s useful in backtests where you may loop through strategies or plots and want each visualization isolated. Finally, plt.show() hands control to the renderer — it blocks in a script until the window is closed and triggers inline rendering in many notebook environments — which is the practical step that makes the plot visible so you can inspect runs, regime shifts, or anomalous days. From a backtesting perspective this quick visualization is primarily about sanity-checking and exploratory analysis: confirm cumulative behavior, spot large drawdowns, check for flatlines (stale data), and validate that timestamps and frequencies are what you expect.

If you need more actionable backtest visuals, consider plotting cumulative returns (e.g., (1 + returns[‘total’]).cumprod()), adding axis labels and a title, enabling a grid, and annotating peak-to-trough drawdowns. Also prefer pandas’ Series.plot or explicitly set figure size/dpi when plotting very long series to avoid performance or readability issues. These additions don’t change the fundamental flow above, but they make the output much more useful for diagnosing strategy performance.

plt.figure()
plt.plot(our_pct_growth, label = ‘ML long/short strategy’, linewidth=2)
plt.plot(benchmark_ptc_growth, linestyle = ‘--’, label = ‘Buy and hold strategy’, linewidth=2)
plt.legend()
plt.show()

This block builds a simple visual comparison between the cumulative performance of your machine-learning-driven long/short strategy and a buy-and-hold benchmark. It starts by explicitly creating a fresh plotting canvas so the two lines won’t be drawn on any preexisting figure. The first plot call draws our_pct_growth — typically a pandas Series indexed by date representing cumulative percentage or point growth — so the x-axis comes from the Series index and the y-values are its cumulative returns; the thicker linewidth improves clarity when examining overlapping traces. Immediately after, the benchmark series is drawn on the same axes using a dashed line style to make it visually distinct; both series are labeled so the legend can map colors/linestyles back to strategy names.

Calling plt.legend() places the legend on the plot so viewers can quickly identify which line corresponds to which strategy, and plt.show() flushes the figure to the screen (or notebook output) so you can inspect it. In the backtesting context this overlay is intentionally simple: seeing both cumulative-return curves together lets you assess relative performance, timing of out- or under-performance, differences in realized volatility (wiggliness of the lines), and periods of drawdown or recovery. Those visual cues are useful for rapid sanity checks (e.g., detecting lookahead bias, data misalignment, or unexpected spikes) before you move on to formal metrics.

A couple of practical reasons behind these choices: using the Series’ index for the x-axis preserves time alignment (critical when comparing strategy vs benchmark), the dashed benchmark helps reduce perceptual clutter, and the thicker lines make small but persistent performance differences easier to spot. If you need deeper diagnostic power, augment this basic plot with annotated drawdowns, axis labels/titles, log-scaling for long horizons, or subplots for rolling statistics; but as written it’s a concise, readable first-pass visualization for comparing backtest outcomes.

def sharpe(returns):
    return np.sqrt(len(returns)) * returns.mean() / returns.std()

This small function computes a single-number Sharpe-style statistic from a sequence of period returns by taking the sample mean of the returns, dividing by their sample volatility, and scaling that ratio by the square root of the number of observations. Conceptually this follows the usual Sharpe logic: the numerator (mean of returns) is the reward per period, the denominator (standard deviation) is the risk per period, and their ratio is a per-period signal-to-noise measure. Multiplying by sqrt(len(returns)) applies the square-root-of-time scaling that assumes independent, identically distributed returns — under that assumption the expected cumulative mean grows linearly with time while variance grows linearly too, so the ratio for the whole sample scales as sqrt(T). In short, the code produces a risk‑adjusted return for the sample as a whole rather than an unscaled per-period metric.

There are several important practical reasons and caveats to keep in mind when using this in a pandas backtest. First, the function does not subtract a risk‑free rate, so it computes a “raw” Sharpe; for meaningful strategy comparisons you should pass excess returns (strategy returns minus risk‑free) instead of raw returns. Second, the sqrt(len(…)) factor only gives an annualized Sharpe if len(returns) equals the number of periods per year (for example, 252 for daily returns); otherwise it yields the Sharpe for the entire sample length, not an annualized value. Third, the calculation relies on mean() and std() semantics of whatever array-like you pass in — pandas.Series.std uses a different degrees-of-freedom default (ddof=1) than numpy.std (ddof=0), so results can differ subtly depending on the type of input. You also need to guard against NaNs and zero volatility (std == 0) which would produce misleading or infinite values.

For robust backtesting practice, normalize your input first (drop NaNs, ensure you feed excess returns), choose the correct annualization factor or use sqrt(T) intentionally when you want the sample aggregate Sharpe, and be aware of distributional violations: serial correlation, nonstationarity and fat tails all bias the naive Sharpe. In those cases consider correcting for autocorrelation (Newey‑West), using return aggregation matched to desired annualization (e.g., multiply by sqrt(252) for daily), or using alternative performance metrics or hypothesis tests that account for small‑sample bias and non‑IID behavior.

print sharpe(our_pct_growth)
print sharpe(benchmark_ptc_growth)

Those two lines compute and display a risk‑adjusted performance metric — the Sharpe ratio — for your strategy and for the benchmark, letting you compare how much excess return each produces per unit of volatility. Conceptually the data flow is simple: a pandas Series of per‑period returns (our_pct_growth and benchmark_ptc_growth) is passed into the sharpe(…) function, the function reduces the series to a single Sharpe value and that scalar is printed. Practically, inside sharpe you should expect the following sequence of operations (why each step matters): drop or ignore NaNs so missing data doesn’t bias the moments; ensure the series represents period returns in the expected units (decimal returns vs. percentages) because a factor‑of‑100 error will ruin interpretation; compute the mean of the returns and subtract a risk‑free rate (we use excess returns to measure compensation for risk rather than raw returns); compute the return volatility (standard deviation) to measure risk; and finally form the ratio of mean excess return to volatility and usually annualize it by multiplying by sqrt(N_per_year) or scaling the mean and std appropriately so the Sharpe is on an annualized basis. Those steps answer the “how” (vectorized pandas reductions and simple algebra) and the “why” (we summarize returns into a single, comparable risk‑adjusted metric that compensates for differing volatilities and sampling frequencies).

A few practical caveats and reasons to inspect the implementation: confirm the convention for the risk‑free rate and whether it is subtracted before or after annualization, verify the annualization factor matches your data frequency (e.g., 252 trading days vs 365 days vs 12 months), and check whether the function uses population or sample standard deviation (ddof) — these choices change the numeric value. Also validate the input semantics: if your series are cumulative percent growth rather than period returns, the Sharpe calculation will be wrong; likewise if they are percent values (0–100) instead of fractional returns (0–1). From a backtesting perspective, printing both Sharpe ratios is useful for an immediate side‑by‑side comparison, but for robust analysis you should also ensure the two series are aligned in time, have identical sampling and lookahead treatment, and preferably record the values (and confidence or bootstrap intervals) rather than just printing them. In short: these prints give a quick, interpretable snapshot of risk‑adjusted performance, but the correctness and usefulness depend on the sharpe implementation and on consistent, correctly formatted input series.

returns.tail()

Here you’re invoking pandas’ tail() on the returns object to get a quick, focused view of the most recent rows. In the backtesting context returns is usually a Series or DataFrame of per-period returns derived from price data (pct_change, log returns, or aggregated/resampled returns), and tail() — by default the last five rows — lets you inspect the end-of-sample behavior without scrolling through the whole dataset.

We do this because the tail of the series is where a lot of subtle issues appear that can break a backtest: off-by-one indexing from pct_change, NaNs created by differencing or resampling, an unexpected extra row from alignment or timezone conversions, or mismatched frequency that would change how signals execute at the end of the sample. Looking at the final rows helps verify that the last return corresponds to the expected timestamp, that there are no spurious NaNs, and that any rolling/aggregation windows produced values as intended. This check is particularly important before computing performance metrics or executing the final trades of a simulated strategy, because errors at the end of the series can create lookahead bias or cause the backtest to drop or misattribute the last trade.

A couple of practical notes: tail(n) accepts an integer if you want a different number of rows (tail(1) to get just the last return), and calling tail() is non‑mutating — it returns a view/slice for inspection without altering the original returns object. If tail reveals problems, the next steps are to trace upstream transformations (resampling, fill/shift operations, alignment with signal series) and fix the source so the series’ tail looks consistent with the trading logic.

Bayesian Neural Networks

import pandas as pd
import numpy as np
import matplotlib.pylab as plt

These three import lines set up the basic data, numeric, and visualization tools that form the I/O and diagnostics backbone for building a Bayesian neural network (BNN). In a typical BNN workflow you start by ingesting real-world datasets: pandas gives you rich, table-oriented primitives for reading files, handling missing values, grouping and merging, and performing quick exploratory transforms. You use it here to load and inspect datasets, to perform initial feature engineering (categorical encoding, joins, filtering) and to keep metadata (column names, dtypes) that are useful for reproducibility and analysis. The reason we do this in pandas first is convenience and clarity — it keeps the “data cleaning and feature design” stage expressive and auditable before we move into the numeric-heavy parts of inference.

Once the data is cleaned and the relevant feature/target columns are selected, you move to numpy for the numerics-heavy part of the story. Converting pandas frames to numpy arrays (e.g., DataFrame.values or .to_numpy()) is the deliberate transition from a schema-rich representation to a dense numeric representation suitable for vectorized operations, linear algebra, and probability calculations. In BNNs we repeatedly evaluate likelihoods and priors, compute gradients or forward passes, and sample or manipulate large arrays of parameter draws — numpy’s broadcasting and BLAS-backed array operations let us do that efficiently in CPU memory. Importantly, numpy’s dtype choices (float32 vs float64) and RNG semantics matter for numerical stability and reproducibility in Bayesian computations, so this is where we standardize types, set seeds, and ensure shapes are consistent for downstream inference engines.

The narrative continues with inference and diagnostics: after we implement priors, likelihoods and the forward model (often in pure numpy for toy experiments, or as a bridge to a probabilistic framework), we use numpy to compute vectorized log-probabilities, batch predictions for many posterior samples, and summaries such as means, variances, and credible intervals. Doing those calculations in numpy lets you cheaply evaluate posterior predictive distributions across the dataset (e.g., computing predictive means and quantiles for every input) and to assemble arrays of samples that will be visualized and audited.

Visualization is the final, essential piece: matplotlib.pylab provides the plotting primitives used to perform exploratory data analysis and — and crucially — posterior predictive checks and diagnostics. For BNNs you’ll typically plot trace plots of parameter chains, histograms/density plots of posterior marginals, predictive scatter plots with shaded credible intervals (fill_between), calibration curves, and learning curves (ELBO or negative log likelihood over iterations). Visual diagnostics are how you detect mis-specified priors, non-convergence, multimodality, or over/under-confidence in posterior predictions, so having plotting integrated into the pipeline is not just cosmetic — it’s part of model validation.

A few practical considerations tie these libraries together for robust BNN work. Keep the heavy numeric work in contiguous numpy arrays (avoid per-row pandas operations in tight loops), be explicit about dtypes (float64 is safer for precise MCMC; float32 may be necessary for GPU-backed training), and convert back to pandas only for human-facing summaries or to preserve column metadata. For reproducibility, set numpy’s RNG seed early and propagate any seed into downstream libraries. Finally, for production-scale or GPU-accelerated Bayesian inference you may replace or augment numpy with frameworks like JAX/NumPyro, PyTorch/pyro, or TensorFlow Probability, but pandas and matplotlib remain useful for data handling and diagnostics even in those cases.

data = pd.read_csv(’ethusd_tweets.csv’)[::-1]

This line reads the CSV into a DataFrame and then reverses the row order. Practically, pd.read_csv parses the file into a table where each row is one observation; the [::-1] slice flips that table end-to-end so the last row becomes the first and vice versa. The explicit intent is almost always chronological: many exported time-series CSVs are written newest-first, and reversing them yields oldest→newest order which is the natural ordering for building sequences, rolling windows, and forward-looking train/validation splits.

Why that chronological order matters for Bayesian Neural Networks: BNNs used for forecasting or uncertainty-aware time-series modeling must be trained without seeing future information. If the DataFrame remained newest-first you would risk creating training examples that implicitly contain future observations or make splits that leak information forward in time. Reversing here enforces a causal timeline so that when you generate input–target pairs, do walk-forward validation, or compute rolling statistics you are always using past data to predict the future — a prerequisite for valid posterior/uncertainty estimates.

A couple of practical notes about how this operates and pitfalls to avoid: slicing with [::-1] reverses rows but preserves the original index labels, so you’ll end up with a non-monotonic index (e.g., original timestamps or integer indices in descending order). Downstream operations that expect a monotonic index (resampling, rolling, some time-based split utilities) may behave oddly, so it’s common to follow this with .reset_index(drop=True) or to use df.iloc[::-1] for clarity. Also be mindful of data-leakage-sensitive preprocessing: compute scalers/normalization parameters on the training portion after you’ve fixed chronological order and splits so the BNN’s priors and posterior inferences aren’t contaminated by future data. Finally, while reversing is cheap for moderate datasets, if you care about explicit copies or memory behavior use an explicit copy to avoid ambiguous view/copy semantics.

N = 360

highp = pd.to_numeric(data.ix[:, ‘High’][-N:])
lowp = pd.to_numeric(data.ix[:, ‘Low’][-N:])
openp = pd.to_numeric(data.ix[:, ‘Open’][-N:])
closep = pd.to_numeric(data.ix[:, ‘Close’][-N:])
tweets = pd.to_numeric(data.ix[:, ‘Tweets’].replace(’null’, 0)[-N:])
volume = pd.to_numeric(data.ix[:, ‘Volume’][-N:])
marketcap = pd.to_numeric(data.ix[:, ‘Market Cap’][-N:])

normal_close = closep

highp = highp.pct_change().replace(np.nan, 0).replace(np.inf, 0)
lowp = lowp.pct_change().replace(np.nan, 0).replace(np.inf, 0)
openp = openp.pct_change().replace(np.nan, 0).replace(np.inf, 0)
closep = closep.pct_change().replace(np.nan, 0).replace(np.inf, 0)
tweets = tweets.pct_change().replace(np.nan, 0).replace(np.inf, 0)
volume = volume.pct_change().replace(np.nan, 0).replace(np.inf, 0)
marketcap = marketcap.pct_change().replace(np.nan, 0).replace(np.inf, 0)

normal_close = np.array(normal_close)
highp = np.array(highp)
lowp = np.array(lowp)
openp = np.array(openp)
closep = np.array(closep)
tweets = np.array(tweets)
volume = np.array(volume)
marketcap = np.array(marketcap)

This block is preparing the recent history of market and social signals for input into a Bayesian Neural Network by extracting the last N observations, turning raw series into relative changes, and producing numeric arrays the model can consume. It begins by slicing the last 360 rows for each relevant column (High, Low, Open, Close, Tweets, Volume, Market Cap) and coercing them to numeric values so downstream operations won’t fail on string-likes; for the Tweets column it converts any explicit ‘null’ strings to 0 first, which is a simple missing-value strategy so the percent-change calculation can run without exceptions. The original close prices are also saved in normal_close before transformation because you typically need the raw price series later — for example to convert model outputs on returns back into price-space for evaluation or trading logic.

Next, the code converts the raw series into period-over-period percentage changes. Using pct_change transforms absolute values into relative returns (x_t / x_{t-1} — 1), which is important for model stability and interpretability: returns are closer to stationarity, have smaller dynamic range across instruments, and make priors and likelihood assumptions in a Bayesian model more meaningful. From an optimization and Bayesian inference perspective, working with returns reduces the chance of exploding gradients and makes Gaussian-like priors more appropriate because the numerical scales are similar across features (prices, volume, tweets, market cap).

The code then defensively replaces NaNs and infinite values with zeros. The first element of a pct_change series is NaN by construction, and divisions by zero (e.g., prior value = 0) can produce infinities; replacing these with 0 is a pragmatic choice that treats unknown or undefined relative change as “no change.” This avoids propagating NaN/inf into the BNN, which would break inference and gradient calculations. Be aware this is an imputation choice — it’s simple and robust, but it can introduce bias if missingness is informative; alternatives include forward/backward filling, small epsilon adjustments, masking, or dropping the first row after alignment.

Finally, each pandas Series is converted to a NumPy array because most ML and probabilistic libraries expect contiguous numeric arrays rather than pandas objects. This ensures consistent dtypes and shape behavior when you assemble feature matrices and feed them into the Bayesian model. Two practical notes: (1) .ix is deprecated — prefer .loc or direct column access to avoid surprises when refactoring — and (2) you may want to further standardize or normalize these return arrays (z-score, robust scaling, or convert to log-returns) and explicitly check alignment of timestamps/indices before modeling. Those additional preprocessing steps typically improve convergence and the quality of posterior uncertainty estimates in Bayesian Neural Networks.

WINDOW = 7
STEP = 1
FORECAST = 1

X, Y = [], []
for i in range(0, len(openp), STEP): 
    try:
        o = openp[i:i+WINDOW]
        h = highp[i:i+WINDOW]
        l = lowp[i:i+WINDOW]
        c = closep[i:i+WINDOW]
        v = volume[i:i+WINDOW]
        t = tweets[i:i+WINDOW]
        m = marketcap[i:i+WINDOW]
        
#         y_i = (normal_close[i+WINDOW+FORECAST] - normal_close[i+WINDOW]) / normal_close[i+WINDOW]
        y_i = closep[i+WINDOW+FORECAST]
        x_i = np.column_stack((o, h, l, c, v, t, m))
        x_i = x_i.flatten()

    except Exception as e:
        break

    X.append(x_i)
    Y.append(y_i)

X, Y = np.array(X), np.array(Y)

This block is building the supervised dataset that you’ll feed into a Bayesian Neural Network: for each time step it collects a fixed-length history across several market signals, flattens that history into a single feature vector, and pairs it with a future target value. The outer loop walks through the time series in increments of STEP to create overlapping training examples; with STEP=1 you produce a sliding window at every time index, which increases sample count and preserves temporal continuity between adjacent examples. For each starting index i the code slices WINDOW consecutive observations for each signal (open, high, low, close, volume, tweets, marketcap). Those per-feature slices are column-stacked to form a (WINDOW × num_features) matrix that preserves the time ordering of features, then flattened into a 1-D vector x_i so each sample is a single dense input row suitable for a fully connected (Bayesian) network.

The target Y is constructed from the close price at index i + WINDOW + FORECAST. Because the window covers indices i..i+WINDOW-1, this formula places the target some steps after the end of the input window; with the current constants (WINDOW=7, FORECAST=1) you are grabbing closep[i+8], so double-check whether you intended the one-step-ahead price immediately after the window (that would be i+WINDOW) or a horizon after that. The commented-out line shows an alternative target construction that computes a return (relative change from the close at i+WINDOW to a later index), which is often preferable: returns are more stationary than raw prices and typically better suited for learning and uncertainty estimation in BNNs.

A few implementation details matter for model behavior. Flattening the time × feature matrix produces input vectors of length WINDOW * num_features (7*7 = 49 here); flattening in the current np.column_stack + flatten ordering produces blocks like [o0,h0,l0,c0,v0,t0,m0,o1,h1,…], so temporal adjacency is preserved within the flattened vector but you lose an explicit 3D time dimension that recurrent or convolutional architectures would exploit. Also, the code does no scaling or normalization: for Bayesian models this is important because unscaled inputs and targets interact poorly with priors and posterior inference — you should standardize or normalize continuous features (and consider log transforms for volume/marketcap) and preferably use returns or standardized price deltas for Y to keep the predictive scale reasonable.

Finally, be careful with robustness and indexing. The try/except around the slicing will break out of the loop on any exception and silently stop creating examples; that effectively truncates your dataset once you reach indices where i+WINDOW+FORECAST is out of bounds, but it also masks other bugs. Prefer computing an explicit loop bound (e.g., range(0, len(openp) — WINDOW — FORECAST + 1, STEP)) and catching only IndexError if you must. Also consider whether you want overlapping windows (STEP=1), whether to shuffle samples before training, and whether to keep the time dimension (n_samples, WINDOW, n_features) instead of flattening if you plan to use temporal BNN layers — all of these choices affect how the Bayesian model represents uncertainty and how well it generalizes.

plt.figure()
plt.plot(Y)
plt.show()

This block is a very small visualization step: it allocates a fresh plotting canvas (plt.figure()), renders the array or sequence Y as a line plot (plt.plot(Y)), and then forces the figure to be displayed (plt.show()). The implicit data flow is simple — whatever numerical sequence Y contains is interpreted as y-values indexed by their position (i.e., x axis = 0..N-1 unless Y is paired with explicit x values) and turned into a connected line. The explicit creation of a new figure before plotting ensures you don’t accidentally overdraw or mix this plot with any previously-created figures, and plt.show() makes the rendering visible in non-interactive scripts (in interactive notebooks it is often optional).

In the context of Bayesian Neural Networks, this quick plot is most commonly used as an exploratory check: you might be plotting the posterior predictive mean across inputs, a time series of losses, a single chain of sampled parameter values, or one realization from the posterior predictive distribution. The reason for visual inspection here is diagnostic — visualization quickly reveals whether predictions follow expected trends, whether there are obvious biases, mode collapse, oscillations, or other pathologies that numeric summaries might hide. Because BNNs emphasize uncertainty, a plain line plot is often only the first step; if Y represents mean predictions you should also visualize uncertainty (for example, many posterior predictive samples, shaded credible intervals, or error bars) so you can assess calibration and heteroscedastic behavior rather than just point estimates.

A few practical caveats and improvements tied to BNN workflows: be explicit about axes and legends (labels and a title) so the plot is interpretable in reports; if Y is 2D (e.g., multiple chains, multiple posterior predictive samples, or multiple outputs) decide whether to plot aggregated statistics (mean ± std) or multiple thin traces with transparency to show sample spread; use fill_between or errorbar to communicate uncertainty bands; and remember plt.show() blocks execution in scripts until the window is closed, whereas in notebooks the figure usually renders inline. This minimal plotting step is useful for fast iteration and debugging, but for rigorous evaluation of a Bayesian model you’ll typically augment it with visualizations that directly communicate posterior uncertainty and model fit.

print X.shape, Y.shape

This single inspection is doing more than just showing two tuples: it’s a lightweight sanity check that anchors the rest of the Bayesian neural network pipeline to concrete data dimensions. In a BNN workflow X is the design matrix (rows = examples, columns = features) and Y is the target array (rows = examples, columns = targets or classes). Knowing X.shape and Y.shape tells you the dataset size (N = number of observations) and the input/output dimensionalities that determine how you instantiate the model — e.g., the input layer width must match X.shape[1], and the output layer or likelihood parameterization must match Y.shape[1] (or be treated as a scalar if Y is shape (N,)). These dimensions directly affect prior sizes, the parameterization of the predictive distribution, and how you compute the log-likelihood (you sum/average across the same N and must align feature/label axes when broadcasting).

Beyond initialization, checking shapes early prevents subtle, costly mistakes: transposed data (D×N vs N×D) will silently break assumptions about feature order and weight shapes; a label vector shaped (N,) vs (N,1) can change broadcasting behavior in loss calculations; and multiclass versus scalar regression targets require different likelihoods (one-hot (N,K) for categorical cross-entropy, (N,1) or (N,) for Gaussian regression). In variational inference and minibatch stochastic gradient setups, N and the batch dimension govern how you scale ELBO terms and construct minibatch masks — a wrong shape can make your Monte Carlo estimates incorrect or raise runtime shape-errors when summing over observations. Because of that, printing shapes at the start is a practical step to confirm that the data flow matches the probabilistic model’s expectations; in production or tests you’d replace this ad‑hoc print with assertions or structured logging that fail fast and document the expected shapes.

from keras.models import Sequential
from keras.models import Model
from keras.layers.core import Dense, Dropout, Activation, Flatten, Permute, Reshape
from keras.layers import Merge, Input, concatenate
from keras.layers.recurrent import LSTM, GRU, SimpleRNN
from keras.layers import Convolution1D, MaxPooling1D, GlobalAveragePooling1D, GlobalMaxPooling1D, RepeatVector, AveragePooling1D
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger
from keras.layers.wrappers import Bidirectional, TimeDistributed
from keras import regularizers
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import *
from keras.optimizers import RMSprop, Adam, SGD, Nadam
from keras.initializers import *
from keras.constraints import *
from keras import regularizers
from keras import losses

This import block is the toolkit for assembling and training a range of neural architectures you might use as components in a Bayesian Neural Network (BNN) experiment, so the first thing to understand is the mental model it enables: you can build convolutional, recurrent, and dense subnetworks, combine them flexibly with the Functional API, regularize them with explicit penalties or stochastic approximations, and run controlled training procedures that log and checkpoint models. The presence of both Sequential and Model (Functional API) plus Merge/concatenate indicates the codebase expects to mix modular blocks — e.g., separate encoders for different inputs or feature streams that are later concatenated — which is important for BNNs because we often want to compose probabilistic submodels and treat their outputs as random variables that are fused downstream.

The layer imports show the kinds of representational primitives available: Dense, Convolution1D, and LSTM/GRU/SimpleRNN give you fully-connected, sequence-convolutional, and sequence-recurrent building blocks respectively; TimeDistributed, RepeatVector, Permute and Reshape are there to align temporal and channel dimensions when connecting these blocks. Use 1D convolutions + pooling (MaxPooling1D, GlobalAveragePooling1D/GlobalMaxPooling1D) when you want local feature extractors that produce invariances and compress temporal structure before you place a probabilistic readout. For sequence modeling and encoder/decoder patterns you’ll rely on Bidirectional and the recurrent layers; TimeDistributed is essential when applying the same dense head to each timestep of a sequence output. In a BNN context these choices affect where and how uncertainty is modeled: convolutional and recurrent blocks produce structured, correlated features, and your uncertainty propagation decisions (e.g., keep dropout inside recurrent cells, apply dropout to conv channels, or place uncertainty only at the output head) will change the posterior behavior.

Regularization, initialization and activation choices are where the “Bayesian” behavior is most concretely expressed. Dropout is present because Monte Carlo Dropout is a practical variational approximation: by leaving dropout active at inference and averaging multiple forward passes you approximate a posterior predictive distribution. The regularizers import (L1/L2) is your place to encode simple Gaussian or Laplace priors on weights (weight decay as a MAP prior), and constraints let you restrict parameter support if you need to implement, for example, bounded priors or clipping to stabilize training. Initializers matter because prior mean and scale of weights influence the initial posterior; thoughtful initialization reduces mode collapse and enables better posterior exploration. BatchNormalization and advanced activations (LeakyReLU, PReLU, ELU, etc.) accelerate and stabilize optimization but in Bayesian settings you must be cautious: BatchNorm’s running statistics and test-time behavior interact badly with stochastic inference (it can leak dataset-wide estimates into predictions), so either re-fit BN for stochastic passes or avoid it if you rely on MC dropout for uncertainty.

Training utilities and loss/optimizer choices control how you approach the posterior approximation. The callbacks (ModelCheckpoint, ReduceLROnPlateau, CSVLogger) are about reproducibility and training control: checkpointing preserves snapshots (useful if you later want to ensemble different checkpoints as part of a posterior multimodal approximation), ReduceLROnPlateau helps escape noisy plateaus and improves final likelihood, and CSVLogger tracks metrics for diagnostics. The imported optimizers (RMSprop, Adam, SGD, Nadam) provide different dynamics for exploring weight space — adaptive methods like Adam can converge faster but sometimes underestimate uncertainty compared to SGD; choose with awareness of how optimizer noise and step sizes affect posterior dispersion. The losses module is where you’ll either use standard negative log-likelihood terms or implement custom objectives that add an explicit KL term (NLL + λ * KL) to form a variational lower bound — Keras’ loss plumbing is flexible enough to add per-batch or per-sample KL penalties, which is essential for principled variational BNN training.

Finally, a few practical cautions and how-tos for turning these pieces into a working Bayesian model: Keras doesn’t provide native Bayesian layers here, so you implement approximate inference by either (a) using dropout and Monte Carlo averaging for predictive uncertainty, (b) adding weight regularization terms that approximate priors, or © creating custom layers/losses that compute weight-space KL divergences and add them to the training objective. Keep dropout active at inference if you’re using MC Dropout; be careful with BatchNorm (prefer using layer normalization or remove it in parts of the network involved in stochastic inference), and use the Functional API (Model) rather than Sequential when you need multi-input or multi-headed probabilistic structures. Track training with checkpoints and CSV logs so you can analyze predictive distributions over checkpoints, and tune optimizer and learning-rate schedules to avoid overconfident collapse or underfitting — these practical choices have a direct impact on whether your final predictive distribution is well-calibrated as an approximation to the Bayesian posterior.

def get_model(input_size):
    main_input = Input(shape=(input_size, ), name=’main_input’)
    x = Dense(25, activation=’linear’)(main_input)
    output = Dense(1, activation = “linear”, name = “out”)(x)
    final_model = Model(inputs=[main_input], outputs=[output])
    final_model.compile(optimizer=’adam’,  loss=’mse’)
    return final_model

This small factory function builds and compiles a plain Keras feed-forward regressor, and it’s useful to read it as a pipeline: a feature vector of length input_size enters the model, is linearly projected into a 25-dimensional hidden representation, and that representation is linearly reduced to a single scalar prediction which the model is trained to match using mean-squared error optimized by Adam.

Breaking that down: Input(shape=(input_size,)) establishes the expected feature vector shape so downstream layers get consistent tensors. The first Dense(25, activation=’linear’) applies an affine transform to the input and purposely does not apply a nonlinearity; the next Dense(1, activation=’linear’) applies another affine transform to produce the scalar output. Because both layers use linear activations, the entire two-layer network collapses to a single linear mapping (technically a low-rank linear map with rank ≤ 25): there is no representational nonlinearity introduced between input and output. That is important to notice — if your intent is to capture nonlinear relationships in data you should switch the hidden layer’s activation to a nonlinear function (ReLU, tanh, etc.). If, on the other hand, this model is intended only as a linear mean function or as a simple baseline inside a larger Bayesian construction, the linear choice can be deliberate.

The compile step (optimizer=’adam’, loss=’mse’) sets up standard deterministic regression training: Adam provides adaptive first/second moment updates for faster convergence, and MSE corresponds to minimizing squared error, which is equivalent to maximizing a Gaussian likelihood with fixed variance. In the Bayesian neural network (BNN) context this is a key point: a pure MSE objective ignores priors and posterior uncertainty. For a true BNN you normally replace Dense with probabilistic (Bayesian) layers or augment the loss with a KL term (ELBO) and/or optimize a negative log-likelihood that includes predictive variance. In other words, this function returns a deterministic backbone that can be used as-is for point-estimate regression, or adapted into a BNN by (a) swapping Dense for BayesianDense/variational layers, (b) outputting both mean and log-variance for heteroscedastic likelihoods, and © changing the training loss to an ELBO or NLL with KL regularization.

Finally, the naming (main_input, out) and the single-output design reflect the goal of scalar regression predictions — a common choice for many likelihood-based Bayesian regression setups where the network models the predictive mean. If you intend to model predictive uncertainty within the network itself, plan to (1) introduce nonlinearity where appropriate, (2) expose variance from the network output or use Bayesian layers, and (3) change the loss/optimization to incorporate priors and likelihood explicitly.

from sklearn.model_selection import TimeSeriesSplit

reduce_lr = ReduceLROnPlateau(monitor=’val_loss’, factor=0.9, patience=10, min_lr=0.000001, verbose=0)
checkpointer = ModelCheckpoint(filepath=”testtest.hdf5”, verbose=0, save_best_only=True)
    
X_train, Y_train = X[:-30], Y[:-30]
X_test, Y_test = X[-30:], Y[-30:]
    
model = get_model(len(X_train[0]))
history = model.fit(X_train, Y_train, 
              epochs = 100, 
              batch_size = 64, 
              verbose=1, 
              validation_data=(X_test, Y_test),
              callbacks=[reduce_lr, checkpointer],
              shuffle=True)

This block is the training orchestration for a neural network whose ultimate purpose is to behave as a Bayesian model (i.e., to produce well-calibrated uncertainty and a stable posterior-like solution). The first two lines set up two training-time behaviors that are wired into the fit loop: ReduceLROnPlateau watches validation loss and decays the optimizer learning rate by a factor of 0.9 whenever the validation loss does not improve for 10 epochs, down to a floor of 1e-6; this is intended to let the optimizer take progressively smaller steps as improvement stalls so the model can converge to a tighter mode of the posterior (or a better MAP estimate) rather than oscillating or overshooting. ModelCheckpoint is configured to persist the best-performing weights (save_best_only=True) so you can restore the single best model seen by validation loss — important for later inference or for using the saved weights as the starting point for any posterior-sampling or ensembling stage.

Next, the code performs a simple chronological holdout: the last 30 samples are reserved as a test/validation set and everything before that is training data. Using a contiguous tail as validation is a common pattern for time series because it preserves temporal ordering and tests generalization to future data; that tail is then passed explicitly to the fit call as validation_data so that callbacks and metrics are evaluated against future points. Note that TimeSeriesSplit is imported but not used here — TimeSeriesSplit would be the next logical improvement if you want multiple temporal folds instead of a single holdout for more robust assessment of predictive uncertainty across time.

The model is constructed with get_model(len(X_train[0])) so the architecture is parameterized by the input dimensionality (for example, number of features or window size), and then model.fit drives training for up to 100 epochs with mini-batches of 64. Training uses the explicit validation_data tuple (X_test, Y_test) so that validation loss drives both the learning-rate scheduler and checkpointing. The callbacks list wires the two behaviors described above into the loop. Returning the history object captures training and validation loss traces which are essential diagnostics for Bayesian setups — you’ll want to inspect these to judge whether the posterior approximation has stabilized or whether more epochs / different priors / different optimizer tuning are needed.

A couple of important behavioral notes tied to the Bayesian goal: shuffle=True is active here, which randomizes training sample order each epoch; while shuffling is often good for i.i.d. data, it can leak temporal structure in time-series problems and produce an overly optimistic validation signal, so prefer shuffle=False for sequence-dependent tasks or use time-aware batching. Also, saving only the single best checkpoint is fine if you want the MAP weights, but if your downstream Bayesian procedure relies on weight ensembles or multiple samples (e.g., snapshot ensembles, SWAG, or checkpoints for MCMC initialization), you should save multiple checkpoints or store weight samples during training. Finally, the ReduceLROnPlateau schedule helps with stable convergence of the optimizer around a posterior mode, but if you are using a full Bayesian inference method (variational inference, MCMC, etc.), consider how optimizer annealing interacts with your approximate posterior objective — aggressive LR changes can bias the optimization path and therefore the learned uncertainty.

model.load_weights(’testtest.hdf5’)
pred = model.predict(X_test)
predicted = pred
original = Y_test
plt.plot(original)
plt.plot(predicted, linestyle = ‘--’)
plt.show()
print X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
print np.mean(np.square(predicted - original))
print np.mean(np.abs(predicted - original))

This block is the evaluation/visualization portion of the pipeline: you restore a previously trained model, run it on the test inputs to produce point predictions, visually compare those predictions to the true targets, and compute two simple error statistics. Concretely, load_weights replaces the model’s current parameters with the saved parameters in ‘testtest.hdf5’ so the network is put into the exact trained state used during persistence; we do this because we want to evaluate the learned mapping rather than re-train. model.predict(X_test) performs a forward pass over the test set and returns the network’s outputs for each test datum — note that predict is a deterministic call for the current parameterization, so the single output you get is a point estimate of the predictive mean under the loaded weights.

The plot calls overlay the true series (original) and the model’s predictions (predicted) so you can visually inspect fit quality and systematic deviations. Using a dashed line for predictions helps distinguish them from the ground truth. Be aware of shapes here: predict typically returns an array of shape (n_samples, output_dim) while your Y_test might be (n_samples,) or (n_samples, 1); mismatched shapes can lead to broadcasting or plotting artifacts, so it’s good practice to explicitly flatten or reshape outputs for plotting and metric computation to ensure elementwise correspondence.

The two printed shape values are a quick sanity check — they confirm that the training and test splits have the expected dimensions and that X/Y pairings are aligned. The printed mean squared error (np.mean(np.square(predicted — original))) and mean absolute error (np.mean(np.abs(predicted — original))) are simple, interpretable summaries of point-prediction accuracy: MSE penalizes larger errors more strongly and corresponds to the squared-loss objective often used in regression, while MAE gives a robust average absolute deviation. Both are computed elementwise and averaged across all test examples (and across output dimensions if the outputs are multidimensional), so when you have multiple output variables or different scales you should either report per-dimension metrics or normalize appropriately.

From a Bayesian neural network perspective, this code only gives a single point estimate and therefore does not characterize predictive uncertainty. In a proper BNN workflow you’d want to perform multiple stochastic forward passes (for example, enabling dropout at inference or sampling weights from the learned posterior) to approximate the posterior predictive distribution, then compute the predictive mean and variance and visualize credible intervals rather than just one deterministic line. Also consider using negative log-likelihood or proper scoring rules (e.g., continuous ranked probability score) in addition to MSE/MAE to evaluate probabilistic predictions. Finally, add explicit shape checks or flattening (e.g., predicted.ravel()) before plotting/metric computation to avoid silent broadcasting bugs.

import tensorflow as tf
sess = tf.Session()
with sess.as_default():
    tf.global_variables_initializer().run()

dense_weights, out_weights = None, None
with sess.as_default():
    for layer in model.layers:
        if len(layer.weights) > 0:
            weights = layer.get_weights()
            if ‘dense’ in layer.name:
                dense_weights = layer.weights[0].eval()
            if ‘out’ in layer.name:
                out_weights = layer.weights[0].eval()

This code’s purpose is to pull the current numeric values of two specific weight tensors out of a Keras/TensorFlow model so they can be used outside the graph — something you commonly do in Bayesian Neural Network (BNN) workflows when you want the current point estimate of weights (for example to initialize a variational posterior mean, compute a Laplace/Fisher approximation, or inspect learned parameters). The execution is driven by a TensorFlow 1.x Session and proceeds in two stages: initialize the graph variables, then iterate the model’s layers and evaluate the relevant weight variables.

First, a tf.Session is created and made the default; immediately after, global_variables_initializer().run() is executed so every Variable in the graph has a concrete value. That initialization is necessary because the subsequent weight evaluations call .eval() on TensorFlow Variables — .eval() requires a running session and initialized variables, otherwise you’ll get an error or uninitialized values. This initialization step therefore ensures the graph is in a runnable state and that any weight tensors have defined numeric contents to fetch.

Next, the code walks the model’s layers sequentially. For each layer that actually has weights (len(layer.weights) > 0), it first calls layer.get_weights() (which returns NumPy arrays) but then chooses to extract the TensorFlow Variable and call .eval() on it via layer.weights[0].eval(). The practical result is that dense_weights and out_weights become NumPy arrays containing the current kernel (the first weight tensor) of the last layer whose name matches the string checks. The string checks (‘dense’ in layer.name and ‘out’ in layer.name) are a simple naming heuristic to pick which weight tensors to capture: typically ‘dense’ matches intermediate fully-connected layers and ‘out’ matches the final output layer. Note the code only grabs the first weight tensor for each layer — that is usually the kernel matrix and not the bias vector — so it intentionally extracts kernels only.

Why this pattern matters in a BNN context: we often need the concrete, current-weight arrays to form priors/initial posteriors, compute per-parameter statistics (variances, Fisher diagonals), or to seed MCMC/variational inference. Converting graph Variables to NumPy arrays via .eval() gives you those point estimates so you can perform non-graph computations (numpy/scipy) or serialize them. Using the session and eval is the TF1 way to bridge between graph-held Variables and outside-the-graph code that implements Bayesian routines.

A few practical notes and cautions: the call to layer.get_weights() is redundant here because its return value is not used; you can either use layer.get_weights()[0] (which is already a NumPy array) or drop that call and rely solely on layer.weights[0].eval(). The current name-based selection is brittle — better to fetch layers by explicit name or type or to document naming conventions — and you’ll miss biases because only weights[0] is fetched. Also, this pattern is specific to TF1 graph mode; in TensorFlow 2 / Keras’ eager mode you’d simply call model.get_layer(…).get_weights() without a session. Finally, wrapping sess.as_default() twice is unnecessary; one consistent session context is sufficient and clearer.

print dense_weights.shape
print out_weights.shape

print type(dense_weights)

These three prints are a small, purposeful runtime sanity check that tells you whether the two weight tensors you’re about to use in the forward and Bayesian bookkeeping paths actually match the dimensional and framework expectations. First you print dense_weights.shape and out_weights.shape: that gives you the dimensional layout of the hidden-layer weight matrix and the output-layer weight matrix so you can immediately verify they will multiply with activations in the forward pass and align with any variational parameter tensors (means, log-variances, posterior samples). In a Bayesian neural network we frequently perform elementwise operations (e.g., adding sampled noise of the same shape, computing per-weight KL contributions, or broadcasting variational parameters), so a shape mismatch is a common root cause of incorrect ELBO values or silent broadcasting bugs; printing shapes early helps catch that before you run a full training loop.

The exact meaning of each shape depends on the framework: for Keras/TF a Dense kernel is (input_dim, units), while PyTorch’s Linear.weight is (out_features, in_features). Because that convention differs, these prints are an explicit check that you’ve either constructed your layers with the intended shapes or that you remembered to transpose when converting between conventions. For Bayesian layers in particular, you must ensure the shapes of any auxiliary tensors — posterior mean, log-variance, samples, and prior parameters — match the weight tensors so KL terms reduce correctly and reparameterization (rsample) produces correctly shaped noise for the forward computation.

Printing type(dense_weights) is the second, complementary check: it verifies whether the object is a framework tensor (torch.Tensor, tf.Variable, np.ndarray) or a higher-level variational object (a custom Distribution/Layer/Parameter container). The type determines what operations are available and how gradients or sampling are handled. For example, a torch.nn.Parameter participates in autograd and will be optimized by optimizers, a torch.Tensor on the wrong device will cause runtime errors, and a numpy array won’t support backprop at all. In a BNN we sometimes expect objects that implement .rsample() or .sample() and hold both mean and scale parameters; seeing the concrete type tells you whether you should call .rsample(), access .mean/.logvar, or wrap/convert the object before using it in loss/ELBO calculations.

In practice these prints are used as quick diagnostics during development: confirm dimensional alignment for matrix multiplies and KL reductions, confirm framework and device compatibility for gradient flow and optimizer updates, and detect if you accidentally replaced a parameter with a distribution container or a plain NumPy array. If the shapes or types are unexpected, the next steps are deterministic: assert/reshape or transpose to the correct convention, convert or move tensors to the expected backend/device, or adapt the Bayesian bookkeeping code (sampling/kl) to the object’s API so the variational inference computations remain consistent.

import torch
import torch.nn as nn
from torch.nn.functional import normalize  # noqa: F401
import torch.nn.functional as F

from torch.autograd import Variable

import pyro
from pyro.distributions import Normal, Bernoulli  # noqa: F401
from pyro.infer import SVI
from pyro.optim import Adam

pyro.get_param_store().clear()

This small block is the scaffolding for training a Bayesian neural network (BNN) with Pyro on top of PyTorch. At a high level it pulls in the pieces you need to (1) build neural-network components, (2) express probabilistic models and variational approximations, and (3) run stochastic variational inference (SVI) with a gradient optimizer. The narrative below follows the way training ordinarily flows once these pieces are used together.

First, the PyTorch pieces (torch, torch.nn as nn, and torch.nn.functional as F) give you the deterministic neural network building blocks: modules, layers, and low-level ops you’ll use to construct the mean function or the neural parameterization inside your probabilistic model and guide. The separate import of normalize from torch.nn.functional suggests you might normalize activations or input features at some point to stabilize training (normalizing inputs/activations prevents extreme scales that can hurt gradient-based optimization), while keeping both nn and F lets you mix module-based layers with stateless functional calls as appropriate.

Next come the Pyro imports. Normal and Bernoulli are the primitive distributions you’ll typically use to express your prior over parameters and your data likelihood: for example, Normal priors on weights (to express uncertainty about parameter values) and Bernoulli likelihoods for binary classification targets. Pyro’s SVI is the core inference engine: you provide a model (describing priors and the generative process), a guide (a learnable variational approximation to the posterior, usually parameterized by neural nets or pyro.param locations and scales), and SVI will optimize the evidence lower bound (ELBO) to bring the guide close to the true posterior. The Adam optimizer class from pyro.optim is the optimization wrapper used to update the guide parameters; it maps directly to efficient, adaptive gradient updates (Adam is chosen because it works well in practice on the noisy gradients that SVI produces).

Putting those pieces together in a typical training loop: the model samples latent variables (weights) from priors (Normal), computes predictions and the data likelihood (e.g., Bernoulli for binary labels or Normal for continuous targets), while the guide provides parameterized variational distributions (often Normal with learnable means and log-scales). SVI computes stochastic estimates of the ELBO (or another chosen loss), differentiates through the guide and model, and Adam updates the guide’s parameters to reduce the ELBO — i.e., to improve the posterior approximation. This is why we import SVI and an optimizer here: they are the mechanism that turns your probabilistic model and guide into a trainable BNN.

Finally, pyro.get_param_store().clear() resets Pyro’s global parameter store. That’s important because Pyro persists parameters (pyro.param entries) across runs; clearing ensures you start training with a clean slate rather than accidentally reusing parameters or learned state from a previous experiment. One practical note: the code still imports torch.autograd.Variable, which is legacy — modern PyTorch uses tensors with requires_grad=True directly, so Variable is unnecessary and can be removed in contemporary code. Overall, this block doesn’t perform modeling itself but prepares the exact libraries and runtime state you’ll use to define priors/likelihoods, build a variational guide, and run SVI to learn a posterior over network weights for a Bayesian neural network.

X_train, Y_train = Variable(torch.Tensor(X[:-30])), Variable(torch.Tensor(Y[:-30]))
X_test, Y_test = Variable(torch.Tensor(X[-30:])), Variable(torch.Tensor(Y[-30:]))
data = torch.cat((X_train, Y_train), 1)

The first two lines perform a simple train/test split and convert the raw NumPy-like arrays X and Y into PyTorch tensors that participate in autograd. Concretely, the code takes all but the last 30 examples as the training set and the final 30 as the test/holdout set — a deterministic holdout that’s useful for evaluating predictive uncertainty later on (in BNNs you typically want a fixed set to compare posterior predictive distributions). Wrapping with torch.Tensor ensures the values become 32-bit floating tensors compatible with PyTorch operations, and wrapping those tensors in Variable makes them nodes in the autograd graph so downstream computations (e.g., ELBO gradients during variational inference) can propagate gradients where needed. Note: Variables are an older PyTorch API; in modern code you’d use torch.tensor(…) and control requires_grad explicitly.

The third line concatenates X_train and Y_train column-wise (dim=1) to produce a single 2‑D tensor where each row contains the input features followed by the corresponding target. This is a functional decision: many training utilities, batching routines, or custom likelihood/ELBO implementations expect a single “data” tensor where inputs and outputs are paired, because it simplifies shuffling, minibatch slicing, and passing a pair through a loss that computes both likelihood and priors. The concat requires that X_train and Y_train have matching first-dimension lengths (samples); if Y were a 1‑D vector you’d need to reshape it to a column before concatenating. Also be mindful of the semantics of the split — using the last 30 examples as a holdout is appropriate for time-ordered data but for i.i.d. datasets you might prefer a random split to ensure representativeness.

From the Bayesian Neural Network perspective, these steps set up the inputs that will be used to evaluate the posterior predictive distribution and the ELBO on minibatches: X_train/Y_train feed the likelihood term during training, while X_test/Y_test remain untouched until evaluation to measure calibration and predictive uncertainty. Finally, consider updating this pattern for modern PyTorch (avoid Variable) and explicitly placing tensors on the correct device and dtype to avoid subtle bugs during training.

# get array of batch indices
def get_batch_indices(N, batch_size):
    all_batches = np.arange(0, N, batch_size)
    if all_batches[-1] != N:
        all_batches = list(all_batches) + [N]
    return all_batches

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.predict = torch.nn.Linear(n_hidden, 1)   # output layer

    def forward(self, x):
        x = self.hidden(x)
        x = self.predict(x)
        return x

first_layer = len(X_train.data.numpy()[0])
second_layer = 25   
    
softplus = nn.Softplus()
regression_model = Net(first_layer, second_layer)

The small helper get_batch_indices(N, batch_size) produces a sequence of boundary indices you can use to slice your dataset into minibatches for training. It constructs an array starting at 0 and stepping by batch_size up to (but not including) N, and then appends N as a final endpoint if it isn’t already present; the intended usage is to iterate over adjacent pairs of these boundaries to obtain slices like X[batches[i]:batches[i+1]]. This approach gives you explicit start/end indices rather than fixed-size slices, which makes the last minibatch correctly smaller when N is not an exact multiple of batch_size. One practical caveat: the function assumes N>0 (np.arange would be empty for N==0 and indexing all_batches[-1] would fail); otherwise it robustly covers cases where batch_size > N or when the dataset size isn’t divisible by the batch size.

The Net class defines a simple two-layer feedforward model: a Linear mapping from the input feature space to a hidden representation, and a second Linear mapping from that hidden representation to a single scalar output. In forward it applies the first linear layer then immediately the second, returning the raw linear output. Because there is no nonlinearity between the two Linear layers, this specific forward implementation is functionally equivalent to a single linear transform (the two linear operators can be composed into one), so in practice you either want an activation between them or a clear reason to keep them separate (e.g., to attach different priors in a Bayesian formulation). The construction of first_layer = len(X_train.data.numpy()[0]) is simply extracting the input dimensionality from your training tensor; second_layer = 25 sets the hidden dimensionality to 25 units, which is a model capacity choice that trades expressiveness against overfitting and computational cost.

You’ve also defined softplus = nn.Softplus(), but it isn’t applied in forward. In a Bayesian Neural Network context Softplus commonly serves two distinct roles: as a smooth activation function that prevents sharp saturation (an alternative to ReLU/tanh), and as a convenient monotonic transform to enforce positivity when parameterizing scale parameters (for example converting an unconstrained variational parameter into a positive standard deviation). If your goal is to build a BNN, you’ll typically want either to insert a nonlinearity (softplus or another) between hidden and output to gain representational power, or to use Softplus to parameterize the positive variance of weight/posterior distributions.

Putting this back into the Bayesian-NN story: the minibatch indices support stochastic optimization or stochastic variational inference by letting you compute a minibatch estimate of the ELBO (evidence lower bound) instead of using the full dataset each step. The deterministic Net here is a starting point; to make it Bayesian you would replace or extend its parameters with distributions (e.g., learn variational means and log-variances for weights, or use probabilistic/Flipout-style layers), compute the KL between variational and prior for each parameter set, and optimize the ELBO computed over minibatches. Also consider small practical improvements: move away from repeated .numpy() conversions (keep tensors on the correct device), add a nonlinearity if you want a true multi-layer model, and use Softplus when you need positive-constrained parameters (variances, noise scales) in your probabilistic parameterization.

def model(data):

    mu = Variable(torch.zeros(second_layer, first_layer)).type_as(data)
    sigma = Variable(torch.ones(second_layer, first_layer)).type_as(data)
    bias_mu = Variable(torch.zeros(second_layer)).type_as(data)
    bias_sigma = Variable(torch.ones(second_layer)).type_as(data)
    w_prior, b_prior = Normal(mu, sigma), Normal(bias_mu, bias_sigma)
    
    mu2 = Variable(torch.zeros(1, second_layer)).type_as(data)
    sigma2 = Variable(torch.ones(1, second_layer)).type_as(data)
    bias_mu2 = Variable(torch.zeros(1)).type_as(data)
    bias_sigma2 = Variable(torch.ones(1)).type_as(data)
    w_prior2, b_prior2 = Normal(mu2, sigma2), Normal(bias_mu2, bias_sigma2)    
    
    priors = {’hidden.weight’: w_prior, 
              ‘hidden.bias’: b_prior,
              ‘predict.weight’: w_prior2,
              ‘predict.bias’: b_prior2}
    
    # lift module parameters to random variables sampled from the priors
    lifted_module = pyro.random_module(”module”, regression_model, priors)
    # sample a regressor (which also samples w and b)
    lifted_reg_model = lifted_module()

    with pyro.iarange(”map”, N, subsample=data):
        x_data = data[:, :-1]
        y_data = data[:, -1]
        # run the regressor forward conditioned on inputs
        prediction_mean = lifted_reg_model(x_data).squeeze()
        pyro.sample(”obs”,
                    Normal(prediction_mean, Variable(torch.ones(data.size(0))).type_as(data)),
                    obs=y_data.squeeze())

def guide(data):
    
    w_mu = Variable(torch.randn(second_layer, first_layer).type_as(data.data), requires_grad=True)
    w_log_sig = Variable(0.1 * torch.ones(second_layer, first_layer).type_as(data.data), requires_grad=True)
    b_mu = Variable(torch.randn(second_layer).type_as(data.data), requires_grad=True)
    b_log_sig = Variable(0.1 * torch.ones(second_layer).type_as(data.data), requires_grad=True)
    
    # register learnable params in the param store
    mw_param = pyro.param(”guide_mean_weight”, w_mu)
    sw_param = softplus(pyro.param(”guide_log_sigma_weight”, w_log_sig))
    mb_param = pyro.param(”guide_mean_bias”, b_mu)
    sb_param = softplus(pyro.param(”guide_log_sigma_bias”, b_log_sig))
    
    # gaussian guide distributions for w and b
    w_dist = Normal(mw_param, sw_param)
    b_dist = Normal(mb_param, sb_param)
    
    w_mu2 = Variable(torch.randn(1, second_layer).type_as(data.data), requires_grad=True)
    w_log_sig2 = Variable(0.1 * torch.randn(1, second_layer).type_as(data.data), requires_grad=True)
    b_mu2 = Variable(torch.randn(1).type_as(data.data), requires_grad=True)
    b_log_sig2 = Variable(0.1 * torch.ones(1).type_as(data.data), requires_grad=True)
    
    # register learnable params in the param store
    mw_param2 = pyro.param(”guide_mean_weight2”, w_mu2)
    sw_param2 = softplus(pyro.param(”guide_log_sigma_weight2”, w_log_sig2))
    mb_param2 = pyro.param(”guide_mean_bias2”, b_mu2)
    sb_param2 = softplus(pyro.param(”guide_log_sigma_bias2”, b_log_sig2))
    
    # gaussian guide distributions for w and b
    w_dist2 = Normal(mw_param2, sw_param2)
    b_dist2 = Normal(mb_param2, sb_param2)
      
    dists = {’hidden.weight’: w_dist, 
              ‘hidden.bias’: b_dist,
              ‘predict.weight’: w_dist2,
              ‘predict.bias’: b_dist2}
    
    # overloading the parameters in the module with random samples from the guide distributions
    lifted_module = pyro.random_module(”module”, regression_model, dists)
    # sample a regressor
    return lifted_module()

This code implements a simple Bayesian neural network in Pyro by specifying a generative model (model) and a variational guide (guide). Conceptually, the model declares priors over the network weights and biases, samples a concrete neural network from those priors, runs that sampled network on the input data to produce a predictive mean, and then evaluates the likelihood of the observed targets given that predictive mean. The guide specifies a factorized Gaussian (mean-field) variational family over the same weights and biases, registers learnable variational parameters in Pyro’s parameter store, and returns a sampled network from those variational distributions. During SVI the optimizer updates the guide parameters so that the guide’s sampled networks approximate the posterior over weights implied by the model and the observed data.

Looking at model() in more detail: it constructs two sets of Normal priors — one for the hidden layer (weights shaped [second_layer, first_layer] and biases shaped [second_layer]) and one for the output/predict layer (weights shaped [1, second_layer] and bias shaped [1]). These priors are zero-mean, unit-variance by default (mu = 0, sigma = 1), which is a common weakly-informative choice that regularizes the weights toward zero. The code then bundles those priors into a dictionary keyed by the module parameter names used in regression_model (hidden.weight, hidden.bias, predict.weight, predict.bias) and calls pyro.random_module(“module”, regression_model, priors) to “lift” the deterministic PyTorch module into a stochastic module: random_module returns a factory that, when called, samples concrete parameter tensors from the specified priors and returns a version of regression_model that uses those sampled parameters. This is how the model encodes a distribution over functions rather than a single point estimate.

Once the sampled regressor is obtained (lifted_reg_model = lifted_module()), the code enters a plate/iarange context to express that data points are i.i.d. and to permit minibatching and correct log-probability scaling. The plate is declared with the dataset size N and uses subsample=data to indicate this call may receive a minibatch; this lets Pyro scale ELBO terms correctly when training with subsampled batches. Inside the plate, inputs and targets are separated from the batch tensor, the sampled network is executed to produce a predictive mean for each input, and finally the observation is conditioned via pyro.sample(“obs”, Normal(prediction_mean, 1), obs=y). Using a fixed observation noise of 1 corresponds to assuming homoscedastic Gaussian measurement noise; this simplifies the model but also fixes the likelihood variance (you could instead place a prior on this noise and learn it if needed). The obs registration is what ties the sampled weights to the data — it provides the likelihood term that, combined with the priors, defines the posterior over weights.

The guide() implements the variational approximation to that posterior. It creates learnable mean and (unconstrained) log-scale parameters for each weight and bias in both layers, registers them in Pyro’s parameter store via pyro.param, and transforms the unconstrained scale parameters through softplus to ensure a strictly positive standard deviation. Parametrizing the scale this way (optimize an unconstrained quantity and transform to positive with softplus) is standard because it lets the optimizer search over R while ensuring a valid Normal std. The guide then constructs Normal distributions for each parameter using the registered means and positive stds; because each parameter is represented by an independent Normal, this is a mean-field Gaussian posterior approximation over the network weights and biases.

To produce a sampled network from the guide, the same random_module mechanism is used but now with the guide distributions instead of priors. Returning lifted_module() yields a concrete PyTorch module whose parameters are samples from the variational distributions. During SVI, Pyro samples such guide networks and evaluates the log-density of those sampled parameters under both the guide and the prior (through the model’s pyro.sample statements), which yields the ELBO gradients that update the variational parameters registered by pyro.param.

A few practical notes tied to the implementation choices: the priors are isotropic Gaussians with unit variance, which is a simple regularizer but may be too weak or too strong depending on the problem scale; the observation noise is fixed to one, so if measurement noise is unknown you would want to put a prior on it and include it in the guide. The guide uses softplus on the “log_sigma” parameters rather than exponentiating them, which provides numerical stability and avoids very small stds early in training. The random_module abstraction keeps the model and guide working with the same regression_model code, so the probabilistic machinery is isolated from forward-pass implementation details — this makes it easy to swap in a different deterministic architecture while reusing the same Bayesian wrapping. Finally, note that the code uses Variables and type_as to match device/dtype; in a modern PyTorch codebase you would use tensors with requires_grad and ensure they are created on the correct device, but the probabilistic structure and the variational parameterization remain the same.

# instantiate optim and inference objects
optim = Adam({”lr”: 0.001})
svi = SVI(model, guide, optim, loss=”ELBO”)

N = len(X_train)

for j in range(3000):
    epoch_loss = 0.0
    perm = torch.randperm(N)
    # shuffle data
    data = data[perm]
    # get indices of each batch
    all_batches = get_batch_indices(N, 64)
    for ix, batch_start in enumerate(all_batches[:-1]):
        batch_end = all_batches[ix + 1]
        batch_data = data[batch_start: batch_end]        
        epoch_loss += svi.step(batch_data)
    if j % 100 == 0:
        print(j, “avg loss {}”.format(epoch_loss/float(N)))

You first create an Adam optimizer and an SVI object that ties together your model (the generative/Bayesian neural net with priors and likelihood), the guide (the variational family that approximates the posterior), the optimizer, and the ELBO loss. That combination means each call to svi.step will compute a stochastic estimate of the ELBO for the provided minibatch, backpropagate gradients through both model and guide, and apply an Adam parameter update to the variational parameters (and any learnable model parameters). Using ELBO as the loss forces us to directly optimize the variational lower bound, which is the standard objective when fitting Bayesian neural networks via variational inference.

N is set to the number of training examples so we can reason in per-example terms. The outer loop runs for 3000 full passes (epochs) over the data. At the start of each epoch we zero an epoch-level loss accumulator and create a random permutation of indices; applying that permutation to data shuffles the dataset so that successive minibatches are different each epoch. Shuffling is important for stochastic optimization because it breaks unwanted ordering or correlations in the dataset that would otherwise bias minibatch gradients and can slow or destabilize convergence.

You then compute batch boundaries with get_batch_indices(N, 64), which is expected to return an ordered list of start indices (e.g., [0, 64, 128, …, N]). Iterating over all_batches[:-1] and using the next boundary as batch_end is a safe way to extract contiguous slices without indexing errors; it also cleanly handles a final smaller batch when N is not an exact multiple of 64. For each minibatch you slice out batch_data and call svi.step(batch_data). That call does two things: it evaluates the ELBO (on that minibatch, typically yielding a scalar loss), and it runs the stochastic gradient update that moves the variational parameters (and any model parameters) in the direction that improves the ELBO. Accumulating epoch_loss across all batches gives you a total ELBO-based loss for the epoch.

Finally, printing epoch_loss divided by float(N) every 100 epochs reports an average per-data-point loss for monitoring. Dividing by N makes the metric easier to interpret across different dataset sizes and batch configurations, but be mindful: whether this average exactly equals the true per-example ELBO depends on how svi.step scales its returned loss (some implementations return a sum over the batch, others return a scaled estimate), so interpret it as an approximate per-example diagnostic. Overall, the loop implements minibatch stochastic variational inference: the minibatches give scalability and noisy but useful gradient estimates, Adam provides robust adaptive learning rates for the variational parameters, and optimizing the ELBO trains the guide to approximate the posterior over the Bayesian neural network’s weights.

preds = []
for i in range(100):
    sampled_reg_model = guide(X_test)
    pred = sampled_reg_model(X_test).data.numpy().flatten()
    preds.append(pred)

preds = np.array(preds)
mean = np.mean(preds, axis=0)
std = np.std(preds, axis=0) / 10
y_test = Y_test.data.numpy()
x = np.arange(len(y_test))

plt.figure()
plt.plot(original)
plt.plot(predicted, linestyle = ‘--’)
plt.show()

plt.figure()
plt.plot(x, y_test)
plt.plot(x, mean, linestyle = ‘--’)
plt.fill_between(x, mean-std, mean+std, alpha = 0.3, color = ‘orange’)
plt.show()

This block implements a simple posterior‑predictive sampling loop for a Bayesian neural network and then visualizes the point estimate and uncertainty. The high‑level goal is to approximate the posterior predictive distribution for each test input by drawing many sets of weights from the variational posterior (“guide”), using those weights to produce predictions, and then summarizing those predictions with a mean and an uncertainty band.

First, the loop collects 100 posterior predictive samples. On each iteration guide(X_test) is called to produce a sampled model — i.e., the guide draws a set of network parameters from the variational posterior and returns a callable model instance configured with those sampled parameters. Calling sampled_reg_model(X_test) runs a forward pass under that particular weight sample and yields predictions for all test points. Those predictions are converted to a flat NumPy array and appended to preds. Repeating this process many times yields an empirical sample from the posterior predictive distribution across inputs; this ensemble captures epistemic uncertainty coming from weight uncertainty in the Bayesian model.

After the loop preds is shaped into an array of size (n_samples, n_test_points). Taking np.mean along axis=0 produces the posterior predictive mean for each test point — this is the conventional point estimate you would report. np.std along axis=0 computes the sample standard deviation across predictive draws, which quantifies the spread of the posterior predictive distribution at each input. The code then divides that std by 10; because n_samples is 100, dividing by 10 = sqrt(100) yields the standard error of the predictive mean (std(sample)/sqrt(n_samples)). That is an important distinction: the raw std reflects the predictive uncertainty for a single draw (the uncertainty you would use for a credible interval of future observations), whereas std/sqrt(n) reflects uncertainty about the estimated mean itself. The code uses the latter for the plotted band, so the shaded area represents confidence around the mean estimate, not the full predictive interval.

Next, y_test is extracted to NumPy and x is set to an index range for plotting. The first figure simply overlays two series called original and predicted (these appear to be external variables), while the second figure is the key Bayesian visualization: it plots the true test targets, the posterior predictive mean (dashed), and a shaded region using plt.fill_between from mean — std to mean + std (where std here is the standard error). Visually, that shaded band communicates how certain we are about the mean prediction at each input; a wider band means the variational posterior produces more spread in predictions (or we would get a wider band if fewer samples had been used). The alpha and color parameters control the band appearance.

A couple of practical notes tied to intent: if you want to show a full predictive credible interval for future observations, use the raw sample std (or compute percentiles like 2.5% and 97.5%) rather than dividing by sqrt(n). If you want a tighter estimate of the mean you can increase n_samples; the standard error shrinks as 1/sqrt(n_samples). Also, to avoid autograd/device issues prefer sampled_reg_model(X_test).detach().cpu().numpy() instead of .data.numpy(). Finally, label axes and include a legend so the plotted mean and bands are clearly interpretable to consumers of the visualization.

for name in pyro.get_param_store().get_all_param_names():
    print name, pyro.param(name).data.numpy()

This small loop is an inspection step against Pyro’s global parameter store. Pyro’s param store is the central registry that holds all named tensors you’ve registered with pyro.param during model or guide construction — in a Bayesian neural network those names usually correspond to variational parameters (for example layer weight means and log-scales) or any global model hyperparameters. The code first asks the store for every parameter name it currently knows about (get_all_param_names()), then for each name it fetches the registered tensor (pyro.param(name)), extracts the raw tensor data and converts it to a NumPy array for human-readable printing.

Mechanically, the data flow is: name list → lookup of the Parameter/Tensor by name → access to the underlying storage (.data) → conversion to a NumPy array (.numpy()) → print. The intent is to produce a snapshot of the current numerical values of all parameters so you can inspect shapes, magnitudes, signs, and whether any values have become NaN/Inf. In the context of a BNN this is how you peek at the current variational posterior parameters (e.g., the means and scales that define the approximate posterior over weights) to judge whether learning is progressing, whether variances are collapsing, or whether some parameters are exploding.

There are a few practical “why” points behind doing this: monitoring parameters helps detect common training pathologies early (diverging scales, stagnation at init values, collapsed variance), it verifies that parameters have been registered under the names you expect (useful when wiring complex guides), and it gives you concrete numbers to compare across runs or to log. Converting to NumPy makes the values easy to display and to feed into external diagnostic tools (matplotlib, tensorboard summaries, unit tests), which is why the conversion step is present.

A couple of cautionary implementation notes: using .data bypasses PyTorch’s autograd machinery and can have unintended side effects; prefer .detach() to safely break the tensor from the computation graph. Also, if parameters may live on GPU, you should move them to CPU before converting to NumPy (e.g., .detach().cpu().numpy()) — otherwise .numpy() will raise. Finally, depending on whether you registered a constraint/transform for a parameter, pyro.param(name) will usually present the constrained (transformed) value that the model uses; if you need the unconstrained internal representation you must explicitly access it through the param store APIs.

In short, this loop is a lightweight diagnostic that walks the parameter registry, fetches each parameter’s current numeric values and prints them so you can monitor and debug the variational parameters that define your Bayesian neural network’s posterior approximation. Use it for quick checks, but prefer safer detach/cpu conversion and limit frequency in production logging.

plt.figure()
plt.plot(out_weights)
plt.plot(pyro.param(’guide_mean_weight2’).data.numpy()[0])
plt.show()

This block is producing a quick visual check that compares the empirical/ sampled behavior of one weight with the variational guide’s learned mean for that same weight. It starts by creating a fresh figure so the comparison isn’t drawn on top of some previous plot. The first plotted object, out_weights, is expected to be the sequence of weight values you obtained from your Bayesian procedure — for example a trace of samples from an MCMC run or a sequence of samples obtained from the variational posterior during/after training. Plotting that sequence shows the distribution over time or sample index: you can see its spread, any drift or burn‑in behavior, and possible multimodality.

The second plot overlays pyro.param(‘guide_mean_weight2’).data.numpy()[0], which is the guide’s registered variational parameter representing the mean of the approximate posterior for the second weight. Indexing [0] pulls out the particular scalar/element you want to compare; when matplotlib is given a scalar it draws a horizontal line at that value, so the overlay directly shows the variational mean relative to the sampled values. Conceptually this answers the question “where does the guide place the center of mass of the posterior for this weight?” — a quick diagnostic of whether the guide’s mean is aligned with the samples (and therefore whether the mean-field/parametric approximation captured the posterior location).

Calling plt.show() renders the figure so you can inspect it. Interpreting the result: if the horizontal mean line sits near the center of the cloud of out_weights, the guide’s mean is consistent with the empirical samples; if it’s off-center or outside the main mass, that indicates bias in the approximation (or problems with sampling/convergence). A tight cloud around the mean suggests low posterior uncertainty, while wide spread or multiple clusters suggests high uncertainty or multimodality that a simple mean cannot capture. Two practical notes: (1) retrieving pyro.param with .data is a direct tensor access pattern — in modern PyTorch/Pyro prefer .detach().cpu().numpy() or .item() for scalars to avoid inplace/grad issues; (2) this overlay is a lightweight diagnostic — for a fuller view consider trace plots, histograms/KDEs, or credible-interval bands rather than just a single mean line, since those show uncertainty and shape of the posterior which are central to evaluating a Bayesian neural network.

plt.figure()
plt.plot(dense_weights.T[14])
plt.plot(pyro.param(’guide_mean_weight’).data.numpy()[10])
plt.show()

This small block is a visualization / diagnostic step that compares a sampled posterior trace for a single network weight to the corresponding variational mean learned by the guide. The first plot call is drawing one of the sampled-weight traces: dense_weights is being treated as a matrix of weight draws (either shape [num_samples, num_weights] or [num_weights, num_samples]), and by transposing and selecting index 14 you extract the sequence of sampled values for a particular scalar weight across draws. That sequence is plotted to show the empirical spread and structure of the posterior for that weight (e.g., variance, drift, or multimodality).

The second plot overlays the guide’s mean parameter for (presumably) the same or another weight: pyro.param(‘guide_mean_weight’) returns the variational-mean tensor that the guide is learning, .data.numpy() converts it to a NumPy array, and indexing [10] selects one element. Plotting that value alongside the sampled trace is intended to show where the guide’s mean sits relative to the empirical samples — a quick way to check whether the learned mean is representative of the posterior samples or whether there’s a mismatch indicating poor approximation or a bug.

Because this is a manual, visual check, two practical caveats matter. First, ensure you’re actually comparing the same weight index in both arrays — the code uses index 14 for dense_weights and index 10 for the guide mean, which will not be informative unless those positions correspond to the same scalar parameter; verify shapes (and whether the guide stores weights flattened or structured) to avoid off-by-one or ordering errors. Second, prefer extracting parameters with .detach().cpu().numpy() rather than .data to avoid silently bypassing autograd semantics. Also confirm dense_weights and the guide mean have compatible lengths for plotting a trace vs. a line.

In the context of a Bayesian neural network, this plot is a simple but powerful diagnostic: it helps you validate that the variational posterior (as sampled) and the guide’s summary statistic (the mean) are consistent, reveals whether uncertainty is being learned, and can expose problems like mode collapse, excessive variance, or indexing/parameterization mistakes that would undermine posterior estimates.

print out_weights.mean(), out_weights.std()
print pyro.param(’guide_mean_weight2’).data.numpy()[0].mean(), pyro.param(’guide_mean_weight2’).data.numpy()[0].mean()

The first print is taking the empirical mean and standard deviation of out_weights, which I read as a tensor of posterior samples for a particular weight (or a collection of weights) produced by the model/guide. In the context of a Bayesian neural network, out_weights typically comes from repeatedly sampling the network weights under the variational posterior (or from a Monte Carlo posterior sample). Reporting out_weights.mean() and out_weights.std() is a quick, runtime diagnostic: the mean summarizes the central (point) estimate the posterior places on that parameter, and the standard deviation quantifies the inferred uncertainty. You do this to monitor whether the variational posterior is collapsing (std ≈ 0) or remaining appropriately uncertain, and to track whether the learned mean is sensible given the data and prior.

The second print pulls a stored variational parameter from Pyro’s parameter store — pyro.param(‘guide_mean_weight2’) — converts it to a NumPy array and then takes the mean of its first slice (.data.numpy()[0].mean()). That parameter name implies it’s the guide’s variational mean for the second layer’s weights, so this expression is trying to report the guide’s internal mean estimate for comparison with the empirical mean from the posterior samples. Comparing the guide parameter to the sample moments helps you verify that the guide’s parameterization (the variational mean) matches the actual samples being drawn or that the optimization is moving the variational parameters in the expected direction.

Two practical issues to note. First, the code uses .data.numpy() which bypasses PyTorch’s autograd machinery and can have unexpected side effects; prefer .detach().cpu().numpy() to safely extract tensor values for inspection without interfering with gradients or failing when tensors live on the GPU. Second, the second print repeats the same expression twice, so it prints the same mean value twice — this is almost certainly a copy/paste bug where the author intended to print mean and std (or compare the guide mean to the sample mean). If your goal is to compare moments, print the guide mean and its corresponding guide scale (or compute the std of out_weights) side by side.

In short: the first line reports posterior-sample moments to monitor learned uncertainty and central tendency; the second line inspects the guide’s stored variational mean for the same parameter so you can compare the guide’s internal estimate to the empirical samples. For robust diagnostics in a BNN training loop, extract values with .detach().cpu().numpy(), print both mean and scale (or std) for both the samples and the guide parameters, and ensure you are indexing the correct tensor dimensions instead of duplicating the same statistic.

Neural Network Algo Trading

import numpy as np
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error, classification_report
import matplotlib.pylab as plt
import datetime as dt
import time

from keras.models import Sequential, Graph
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.layers.recurrent import LSTM, GRU
from keras.layers import Convolution1D, MaxPooling1D

%matplotlib inline

This import block sets up the pieces you’ll use to turn raw timestamped values into a simple, trainable time-series forecasting pipeline and to inspect the results. Numpy provides the core array and numerical operations you’ll use for reshaping and manipulating sequences; sklearn.preprocessing is there for scaling (e.g., MinMaxScaler or StandardScaler) because we normalize input sequences to keep the model training stable and to avoid exploding/vanishing gradients in recurrent layers. sklearn.metrics.mean_squared_error gives a standard regression metric for forecasting evaluation; classification_report is also imported but is only relevant if you convert the forecasting problem into a classification task (for example, predicting up/down movement or thresholded classes) — otherwise you can ignore it. Matplotlib and the notebook magic ensure predictions and diagnostic plots render inline so you can visually compare forecasts against ground truth. datetime and time utilities are used to parse, align, and possibly generate the time indices needed to create sliding windows or resample the series.

On the modeling side, Keras Sequential is the straightforward API for stacking layers in a linear flow; Graph is an older Keras construct for non-sequential topologies and is generally superseded by the Functional API (Model) in current Keras — so prefer Sequential or the Functional API for simple pipelines. The core building blocks you’ve imported reflect a typical small forecasting network: Convolution1D and MaxPooling1D can act as learned, local feature extractors across the temporal dimension (they capture short-term patterns and reduce sequence length), LSTM and GRU are recurrent layers that capture longer-term dependencies and temporal context, and Dense/Flatten/Activation/Dropout are used to turn the recurrent output into the desired forecast(s) while controlling overfitting. In practice a common, effective flow is: raw series → parse/align timestamps → form supervised examples with fixed lookback windows → scale features → split into train/validation/test → reshape into (samples, timesteps, features) → optionally apply Conv1D (+ pooling) to extract local motifs → pass through one or more recurrent layers (use return_sequences=True when stacking) → Flatten or take the final timestep followed by Dense layers to predict one-step or multi-step outputs → train with an appropriate regression loss and monitor mean squared error.

A few practical why/how notes: normalize before feeding the network and inverse-transform predictions for interpretable metrics; use Dropout on dense/recurrent parts to reduce overfitting on small series; choose LSTM vs GRU based on complexity vs speed (GRU is simpler/faster, LSTM sometimes models longer dependencies better); make sure to set return_sequences correctly when stacking recurrent layers; and prefer the current Keras Functional API over Graph. Finally, after training compute MSE (and classification_report only if you transformed targets to classes) and plot predicted vs actual values to validate temporal alignment and error patterns — that visual check is often as informative as numeric metrics when developing simple time-series forecasts.

def load_snp_returns():
    f = open(’table.csv’, ‘rb’).readlines()[1:]
    raw_data = []
    raw_dates = []
    for line in f:
        try:
            open_price = float(line.split(’,’)[1])
            close_price = float(line.split(’,’)[4])
            raw_data.append(close_price - open_price)
            raw_dates.append(line.split(’,’)[0])
        except:
            continue

    return raw_data[::-1], raw_dates[::-1]


def load_snp_close():
    f = open(’table.csv’, ‘rb’).readlines()[1:]
    raw_data = []
    raw_dates = []
    for line in f:
        try:
            close_price = float(line.split(’,’)[4])
            raw_data.append(close_price)
            raw_dates.append(line.split(’,’)[0])
        except:
            continue

    return raw_data, raw_dates


def split_into_chunks(data, train, predict, step, binary=True, scale=True):
    X, Y = [], []
    for i in range(0, len(data), step):
        try:
            x_i = data[i:i+train]
            y_i = data[i+train+predict]
            
            # Use it only for daily return time series
            if binary:
                if y_i > 0.:
                    y_i = [1., 0.]
                else:
                    y_i = [0., 1.]

                if scale: x_i = preprocessing.scale(x_i)
                
            else:
                timeseries = np.array(data[i:i+train+predict])
                if scale: timeseries = preprocessing.scale(timeseries)
                x_i = timeseries[:-1]
                y_i = timeseries[-1]
            
        except:
            break

        X.append(x_i)
        Y.append(y_i)

    return X, Y


def shuffle_in_unison(a, b):
    # courtsey http://stackoverflow.com/users/190280/josh-bleecher-snyder
    assert len(a) == len(b)
    shuffled_a = np.empty(a.shape, dtype=a.dtype)
    shuffled_b = np.empty(b.shape, dtype=b.dtype)
    permutation = np.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b


def create_Xt_Yt(X, y, percentage=0.8):
    X_train = X[0:len(X) * percentage]
    Y_train = y[0:len(y) * percentage]
    
    X_train, Y_train = shuffle_in_unison(X_train, Y_train)

    X_test = X[len(X) * percentage:]
    Y_test = y[len(X) * percentage:]

    return X_train, X_test, Y_train, Y_test

This module is a small data-preparation pipeline for a simple time-series forecasting workflow. The overall goal is to transform raw CSV rows (S&P data) into model-ready input windows X and targets Y, optionally preparing classification labels (directional returns) or regression targets (future price). I’ll walk through the logic and explain why each decision is made, point out subtle behaviors, and call out a few places to fix or improve.

load_snp_returns

- Purpose and data flow: This function reads a CSV (skipping the header) and builds two parallel lists: raw_data containing daily returns computed as (close — open), and raw_dates containing the corresponding dates. It skips rows that fail parsing.

- Why: Many forecasting tasks prefer working with returns (price differences) rather than raw prices because returns are often more stationary and appropriate for directional classification tasks.

- Important details and rationale: After collecting values it returns raw_data[::-1] and raw_dates[::-1], i.e., reversed lists. This reversal is deliberate: CSV exports frequently have newest rows first, but time-series models expect oldest-to-newest order. Reversing ensures chronological ordering for downstream windowing and training.

- Potential problem: The function uses a blanket try/except around parsing, which hides parsing errors. Also it extracts columns by index (open at index 1, close at index 4) which is fine but brittle if the CSV schema changes.

load_snp_close

- Purpose and data flow: Very similar to load_snp_returns, but it only collects close prices (no differencing) and their dates. This is used when you want to forecast absolute price values (regression) instead of returns (classification).

- Important inconsistency: Unlike load_snp_returns, this function returns raw_data and raw_dates without reversing them. That means if your CSV is newest-first, the returned series will be reversed relative to the returns loader. That inconsistency is likely a bug — for time-series work you want a consistent chronological ordering across loaders.

- Same caution about broad try/except and brittle column indexing applies.

split_into_chunks

- Purpose and data flow: This is the windowing function: it walks through the time series in steps of size step and extracts an input window x_i of length train and a target y_i located predict steps after the end of that window. The outputs X and Y are lists of input windows and targets suitable for model training.

- How the indexing works: For a given start index i it sets x_i = data[i:i+train] and y_i = data[i+train+predict]. So if i=0 and train=10, predict=1, it selects input indices 0..9 and target at index 11 (note: this is an off-by-one relative to the common convention where predict=1 would target index 10). Verify your definition of predict; you probably want y at i + train + predict — 1 if predict counts steps ahead starting at the immediate next element.

- Two modes: binary=True and binary=False

- binary=True (classification, used for daily returns): The target y_i is converted into a one-hot directional label: [1., 0.] for positive return and [0., 1.] otherwise. This makes the task a two-class classification problem (up vs non-up). Optionally, x_i is scaled (preprocessing.scale) to zero-mean/unit-variance — this normalization stabilizes training and puts features on comparable scales for many classifiers.

- binary=False (regression, used for prices): The function builds a contiguous timeseries slice of length train+predict, scales the whole slice if requested, then sets x_i to everything except the last point and y_i to the last point. Scaling the whole slice together preserves the relationship between the input window and target (important for regression), whereas scaling only the input (as done in the binary branch) is acceptable when targets are converted to classes.

- Iteration and error handling: The for loop advances by step rather than sliding by 1; this down-samples the number of training windows and can be used to control overlap. The code uses try/except to break when indices exceed bounds rather than checking ranges explicitly.

- Things to watch and improve:

- The difference in scaling strategy between binary and regression modes is intentional but should be documented in code comments. For regression you must scale input and target consistently; for classification you may scale only inputs.

- The off-by-one in the y index should be verified against your desired forecast horizon.

- Using a broad except hides real bugs (e.g., bad indices or data types).

- Input x_i is left as Python lists in the binary branch unless you convert to numpy before modeling — ensure consistent array types downstream.

shuffle_in_unison

- Purpose and behavior: When you have parallel arrays (features and labels) you must permute them together so that correspondence is preserved. This function asserts equal length and then builds new arrays shuffled according to a random permutation.

- Implementation notes and why: It creates empty numpy arrays and assigns values according to a permutation vector; this avoids changing the original arrays and ensures X and Y remain aligned after shuffling. Preserving alignment is critical before training because a random shuffle must not break the mapping between each input and its label.

- Performance and alternatives: The explicit for-loop assignment works but is less concise and potentially slower than permutation indexing (e.g., a_shuffled = a[permutation]; b_shuffled = b[permutation]). Also the function requires numpy arrays with .shape — if you pass lists it will fail. Make sure you convert your X/Y to numpy arrays first.

create_Xt_Yt

- Purpose and data flow: This function attempts to split X and y into train and test sets according to percentage, then shuffles the training set in unison.

- Issues to address:

- Index calculation bug: It uses len(X) * percentage directly for slicing. In Python 3 this produces a float and will raise a TypeError when used as a slice index. Convert to an integer explicitly (e.g., split_idx = int(len(X) * percentage)) or use integer division appropriately.

- Time-series semantics: For time-series forecasting you generally must split chronologically (earlier data for training, later for testing). This function slices temporally (good) but then calls shuffle_in_unison on X_train/Y_train. Shuffling destroys temporal order inside the training set — that may be acceptable if your model does not exploit sequence order (e.g., a feed-forward classifier on handcrafted features), but it defeats recurrent or any autoregressive models and invalidates assumptions about the training set representing earlier time. Consider not shuffling, or use walk-forward validation / time-series cross-validation instead.

- Return ordering: The function returns (X_train, X_test, Y_train, Y_test). That’s fine but non-standard (most APIs return X_train, Y_train, X_test, Y_test). Be careful to document and use consistently.

- Data types: If X and y are Python lists, slicing returns lists; shuffle_in_unison expects numpy arrays. Convert to numpy arrays before calling shuffle_in_unison.

Summary of recommended fixes and clarifications

- Make ordering consistent: decide whether your loaders return chronological (oldest->newest) sequences and apply that rule uniformly (likely reverse both loaders).

- Fix the off-by-one target indexing in split_into_chunks or clearly document how predict is defined.

- Replace broad try/except blocks with explicit checks or at least log exceptions so data problems don’t silently pass.

- Ensure slices use integer indices (cast split index in create_Xt_Yt).

- Be deliberate about shuffling for time-series: usually keep chronological order for splits and use time-aware CV when appropriate; only shuffle if your modeling assumptions require i.i.d. samples.

- Normalize types: convert lists -> numpy arrays early and consistently so shuffle_in_unison and scaling behave predictably.

- Consider simpler and faster shuffling using numpy.take or direct permutation indexing to avoid manual loops.

Taken together, these functions form the data-prep story: read raw CSV rows, transform to either returns (for directional classification) or prices (for regression), tile the series into supervised windows with a controllable horizon and stride, optionally normalize inputs, and split into train/test while preserving label alignment. Addressing the points above will make the pipeline robust and consistent for simple time-series forecasting experiments.

timeseries, dates = load_snp_close()
dates = [dt.datetime.strptime(d,’%Y-%m-%d’).date() for d in dates]
plt.plot(dates, timeseries)

The first line calls load_snp_close() and assigns two things: a sequence of closing prices (timeseries) and a parallel sequence of date stamps (dates). Conceptually this is the raw input for the forecasting workflow — a univariate observation stream paired with the times at which those observations were recorded. At this stage we want to visually inspect the data to understand trend, seasonality, outliers, gaps and any obvious structural breaks before committing to a particular forecasting approach.

The second line converts the date strings into datetime.date objects by parsing each string with the specified format and then dropping the time component. The explicit parsing ensures we have true date/time objects rather than opaque strings, which matters for plotting, for frequency inference and for downstream time-aware operations. Using .date() (instead of keeping full datetime) deliberately normalizes the timestamps to day resolution and avoids introducing unnecessary time-of-day granularity or timezone complexity; this is appropriate when the underlying data are daily closes and you want a clean, calendar-day index.

The final line plots the numeric series against those parsed dates. Plotting mapped against actual date objects lets matplotlib produce a readable time axis (ticks, spacing proportional to calendar intervals) so you can immediately see long-term trend, volatility regimes, seasonality (weekly/monthly/yearly patterns), and any missing periods or abnormal spikes. This visual check informs choices that follow in the forecasting pipeline: whether to difference for stationarity, whether to model seasonality explicitly, whether to impute gaps or resample to a regular frequency, and how to split train/test (e.g., contiguous time-based split).

A few practical caveats and next steps: ensure the date list is sorted and corresponds element-wise to the timeseries (unsorted or misaligned timestamps will corrupt modeling), check for missing dates or irregular spacing (which may require resampling or explicit handling), and consider converting the pair into a pandas Series with a DatetimeIndex for easier resampling, rolling statistics, and integration with most forecasting libraries. This plotting step is deliberately exploratory — it doesn’t prepare model inputs yet, but it provides the visual evidence needed to choose appropriate preprocessing and model structure for simple time series forecasting.

TRAIN_SIZE = 20
TARGET_TIME = 1
LAG_SIZE = 1
EMB_SIZE = 1

X, Y = split_into_chunks(timeseries, TRAIN_SIZE, TARGET_TIME, LAG_SIZE, binary=False, scale=False)
X, Y = np.array(X), np.array(Y)
X_train, X_test, Y_train, Y_test = create_Xt_Yt(X, Y, percentage=0.9)

Xp, Yp = split_into_chunks(timeseries, TRAIN_SIZE, TARGET_TIME, LAG_SIZE, binary=False, scale=False)
Xp, Yp = np.array(Xp), np.array(Yp)
X_trainp, X_testp, Y_trainp, Y_testp = create_Xt_Yt(Xp, Yp, percentage=0.9)

This block is preparing sliding-window training examples for a very simple one-step-ahead time-series forecast. At the top you set the key framing parameters: TRAIN_SIZE = 20 means each input example will contain 20 consecutive past timesteps; TARGET_TIME = 1 means the label is the value one step after the end of that window (a one-step forecast); LAG_SIZE = 1 is the window stride/lag between consecutive chunks (so windows will typically overlap by TRAIN_SIZE — 1 points). EMB_SIZE is present but unused here — in other contexts it would control embedding dimensionality for categorical or time-feature embeddings, but it plays no role in the following calls.

split_into_chunks(timeseries, TRAIN_SIZE, TARGET_TIME, LAG_SIZE, binary=False, scale=False) is doing the heavy lifting: it walks the raw timeseries and emits paired examples (X, Y) where each X is a length-TRAIN_SIZE sequence of past values and each Y is the target value TARGET_TIME steps after the window. Passing binary=False indicates we want regression targets (continuous values) rather than binary labels, and scale=False explicitly avoids any normalization or scaling inside that routine — the code keeps raw values, leaving scaling to a later step (or intentionally not scaling if the model expects raw units). Converting the lists to numpy arrays ensures downstream compatibility with numeric ML libraries and with vectorized batching.

create_Xt_Yt(X, Y, percentage=0.9) then splits those arrays into training and test subsets using 90% of the examples for training and 10% for testing. That produces X_train, X_test, Y_train, Y_test ready to feed a simple forecasting model (for example a small neural net, linear model, or an autoregressive baseline) that learns to map a 20-step input window to the single next value.

The second block repeats the same chunking + split into Xp, Yp and X_trainp, X_testp, etc. As written it produces identical data to the first set; common reasons to do this are to keep a pristine copy for a different preprocessing branch (e.g., one copy to be scaled/embedded and another preserved in raw form), to run parallel experiments, or simply because the code was duplicated during iteration. If the duplication is intentional, add a clarifying comment or rename the variables to reflect their purpose; if it’s unintentional, remove the redundant call. One additional practical note: leaving scale=False means your model will train on raw magnitudes — for many models and loss functions it’s better to normalize or standardize inputs and/or targets to improve convergence and numerical stability.

#X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], EMB_SIZE))
#X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], EMB_SIZE))

model = Sequential()
model.add(Dense(500, input_shape = (TRAIN_SIZE, )))
model.add(Activation(’relu’))
model.add(Dropout(0.25))
model.add(Dense(250))
model.add(Activation(’relu’))
model.add(Dense(1))
model.add(Activation(’linear’))
model.compile(optimizer=’adam’, 
              loss=’mse’)

model.fit(X_train, 
          Y_train, 
          nb_epoch=5, 
          batch_size = 128, 
          verbose=1, 
          validation_split=0.1)
score = model.evaluate(X_test, Y_test, batch_size=128)
print score

This block builds and trains a straightforward feed‑forward neural network to map a fixed-length vector of past time steps to a single future value — a simple approach to time series forecasting that treats each training example as a flattened window of prior observations rather than an explicit sequence.

At the input stage the model expects arrays shaped as (n_samples, TRAIN_SIZE) — each sample is a TRAIN_SIZE‑dimensional vector containing lagged features (e.g., the last TRAIN_SIZE observations). The commented reshape lines hint that an alternate design using a 3‑D sequence/embedding shape (n_samples, timesteps, EMB_SIZE) was considered, but they are disabled because this implementation uses Dense layers that require 2‑D inputs. In short: we convert each sliding window into a feature vector and feed that into the fully connected network.

The architecture itself is a classic small MLP for regression: a first hidden Dense layer with 500 units and ReLU activation, then dropout(0.25) to regularize, a second Dense layer with 250 units and ReLU, and a final Dense(1) with a linear activation to produce the continuous forecast. ReLU is used to introduce nonlinearity while keeping gradients healthy; the relatively large first layer gives capacity to learn complex mappings from the lagged inputs to the target, and the smaller second layer pushes the network toward a more compact representation before producing the single scalar output. Dropout is applied after the first activation to reduce overfitting by randomly zeroing units during training, which is especially helpful when windows are small or the dataset is limited.

The model is compiled with the Adam optimizer and mean squared error (MSE) loss. Adam is chosen because it adapts learning rates per parameter and generally converges quickly without much manual tuning; MSE is the natural loss for point forecasting because it penalizes large errors more strongly, aligning the optimization objective with minimizing squared forecast error. Training uses batch_size=128 and runs for 5 epochs with validation_split=0.1, which reserves 10% of the provided training data for monitoring generalization during training. Note that for time series forecasting a random validation split can leak future information into training if the data are not shuffled appropriately; a contiguous holdout (or time‑aware cross‑validation) is usually a better choice to assess true out‑of‑sample performance.

Finally, the code evaluates the trained model on X_test / Y_test with the same batch size and prints the returned score, which is the MSE on the test set. Practical caveats: five epochs may be insufficient to converge; inputs should typically be normalized/scaled before training to improve optimization stability; and using a Dense MLP discards explicit temporal dynamics — it works when meaningful forecasting features are captured by the lagged vector, but for longer-range dependencies or when temporal order matters more, sequence models (RNN/LSTM/Conv1D) or engineered features may yield better results.

params = []
for xt in X_testp:
    xt = np.array(xt)
    mean_ = xt.mean()
    scale_ = xt.std()
    params.append([mean_, scale_])

predicted = model.predict(X_test)
new_predicted = []

for pred, par in zip(predicted, params):
    a = pred*par[1]
    a += par[0]
    new_predicted.append(a)

mse = mean_squared_error(predicted, new_predicted)
print mse

try:
    fig = plt.figure(figsize=(width, height))
    plt.plot(Y_test[:150], color=’black’) # BLUE - trained RESULT
    plt.plot(predicted[:150], color=’blue’) # RED - trained PREDICTION
    #plt.plot(Y_testp[:150], color=’green’) # GREEN - actual RESULT
    #plt.plot(new_predicted[:150], color=’red’) # ORANGE - restored PREDICTION
    plt.show()
except Exception as e:
    print str(e)

This block is performing a per-series de-normalization of model outputs so the predictions can be interpreted on the original time-series scale, and then it attempts to visualize and score those results. The overall goal is simple time-series forecasting: the model was trained on normalized series and now we want to convert the model’s normalized outputs back to the original magnitudes before computing error metrics and plotting.

First, the code iterates over X_testp to collect normalization parameters for each test example. For every test sequence xt it computes the mean and standard deviation and appends [mean, std] to params. The implication is that X_testp contains the same sequences the model saw when normalization was applied (or at least it contains the original-scale sequences corresponding to X_test). Storing mean and std per-example indicates a per-series normalization strategy was used during preprocessing (as opposed to a single global mean/std). This per-series approach preserves local scale and level information and is useful when each time series has different baseline and volatility.

Next, the model produces predictions on X_test (presumably the normalized inputs) via model.predict(X_test). The following loop restores each prediction to the original scale using the paired normalization parameters: for each predicted array and its corresponding [mean, std] pair, the code multiplies the normalized prediction by the stored std and then adds the stored mean. That is the inverse of the usual (x — mean) / std transform, so the logic correctly maps model outputs back to original units. Note that this relies on the prediction array and the stored scalars being shape-compatible; standard numpy broadcasting applies when multiplying a 1-D prediction vector by a scalar std. Also be mindful that if any stored std equals zero (a constant series), the division used in the original normalization would have been problematic and you should handle or avoid zero stds to prevent degenerate scaling.

After restoration the code computes an MSE but compares the original model outputs (still in normalized space) to the de-normalized outputs. That produces a meaningless loss value because the two arrays are on different scales. The intended comparison is almost certainly between the de-normalized predictions and the true target series on the original scale (for example Y_test or Y_testp in original units). So the correct metric call should compare new_predicted to the corresponding unnormalized ground truth. Confirm which ground-truth array holds original-scale targets and use that for any evaluation metrics.

Finally, the plotting block shows the true and predicted series but, as written, plots Y_test (black) and predicted (blue). If predicted is still the normalized output, the plot will not align with the original-scale Y_test. The commented lines hint at alternatives: plotting new_predicted (restored predictions) against the actuals would be the accurate visualization. The try/except wrapper simply prevents crashes during plotting and prints the error string if one occurs; in a production setting you may prefer to log the exception with context. Also consider adding axis labels, a legend, and handling series lengths (the code plots only the first 150 points) so plots compare equal-length arrays.

Recommendations to make this robust: ensure X_testp and X_test are aligned one-to-one so each prediction gets the correct mean/std; use a vectorized inverse-transform rather than Python loops for speed; guard against std==0 when collecting parameters; compute errors between de-normalized predictions and de-normalized targets; and include clear plot labels/legends so it’s obvious which line is which. These changes will ensure the restored predictions are correctly evaluated and visualized in the context of simple time-series forecasting.

timeseries, dates = load_snp_returns()
dates = [dt.datetime.strptime(d,’%Y-%m-%d’).date() for d in dates]
plt.plot(dates, timeseries)

This small block is the initial data-ingest and quick-visualization step for a simple time-series forecasting workflow. First, load_snp_returns() hands back two parallel sequences: the numeric return series and the corresponding date strings. Converting the date strings into datetime.date objects (via strptime and .date()) is a deliberate preprocessing step so the time coordinate is a true date type rather than plain text. Using actual date objects lets plotting libraries (and later time-aware libraries) treat the x-axis as a temporal axis, apply sensible tick formatting, and perform time-based operations (sorting, slicing, resampling) reliably instead of treating dates as arbitrary labels.

The conversion to .date() specifically strips off any time-of-day information and yields daily granularity, which is appropriate for S&P daily returns and reduces complexity from timezones or intra-day timestamps. This choice also signals an assumption about the data frequency (daily); if you were working with intraday data you would want to keep datetime objects with time components. It’s important that the two sequences remain aligned and sorted by date before plotting — otherwise the visual line will connect points in input order and could misrepresent trends or gaps.

The final plt.plot(dates, timeseries) is a lightweight diagnostic: it gives a rapid visual check for trends, seasonality, abrupt regime changes, missing data, outliers, or heteroskedastic volatility — all of which directly inform preprocessing and model choice (e.g., differencing for nonstationarity, volatility models for changing variance, imputation/resampling for irregular timestamps). For downstream modeling and more sophisticated time-series operations you’ll often convert these into a pandas Series or DataFrame with a DatetimeIndex, but for quick inspection this explicit parsing + plot is a clear, robust first step in the forecasting pipeline.

TRAIN_SIZE = 20
TARGET_TIME = 1
LAG_SIZE = 1
EMB_SIZE = 1

X, Y = split_into_chunks(timeseries, TRAIN_SIZE, TARGET_TIME, LAG_SIZE, binary=True, scale=True)

print len(X), len(Y)

X, Y = np.array(X), np.array(Y)
X_train, X_test, Y_train, Y_test = create_Xt_Yt(X, Y, percentage=0.9)

This block is taking a raw univariate time series and turning it into supervised training and test sets for a very simple forecasting task. TRAIN_SIZE = 20 means each input example will summarize the preceding 20 time steps; TARGET_TIME = 1 means the label is the value one step after that input window (a one-step-ahead forecast). LAG_SIZE = 1 indicates the sliding window moves forward by one time step between successive examples, so windows will overlap heavily; that increases sample count but also correlation between adjacent examples. EMB_SIZE is declared here but not used in this snippet — it’s likely reserved for later layers (for example an embedding dimension if features are categorical) and can be ignored for the current conversion step.

split_into_chunks(…) is the key preprocessing operation: it walks the timeseries with the sliding window described above and emits X (the windows) and Y (the corresponding targets). The flags binary=True and scale=True are important for the learning problem and optimization. scale=True normalizes the inputs (and often the targets) so that feature magnitudes are comparable and the optimizer behaves stably — this helps convergence and prevents exploding or vanishing gradients in models that follow. binary=True converts the raw prediction target into a binary signal (for example “up/down” or “anomaly/normal”), changing the problem from regression to classification; doing this is a design decision tied to the business goal (e.g., predicting direction rather than exact value).

The printed len(X), len(Y) is a quick sanity check: it confirms that the chunking produced the same number of inputs and labels and gives a sense of your effective dataset size after windowing. After that, X and Y are converted to NumPy arrays so they can be consumed efficiently by model training code and vectorized libraries. Finally, create_Xt_Yt(X, Y, percentage=0.9) performs the train/test split; percentage=0.9 allocates 90% of the samples to training and 10% to testing. For time series you must be careful about how this split is implemented — you typically want to preserve chronological order to avoid lookahead bias (i.e., the test set should consist of later time steps), so if create_Xt_Yt does random shuffling by default you should disable it or replace it with a time-ordered split.

In short, the flow is: raw timeseries → overlapping fixed-length windows (TRAIN_SIZE) → normalized, possibly binarized input/output pairs → numpy arrays → training and test partitions. The main choices to watch are the window length (TRAIN_SIZE) relative to signal dynamics, the prediction horizon (TARGET_TIME), the overlap (LAG_SIZE), and whether you treat the problem as regression or binary classification (binary flag). Ensure your split preserves temporal ordering and that your scaling is reversible (store the scaler) if you need to interpret model outputs back in the original units.

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], EMB_SIZE))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], EMB_SIZE))

model = Sequential()

model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=True))
model.add(LSTM(input_shape = (EMB_SIZE,), input_dim=EMB_SIZE, output_dim=HIDDEN_RNN, return_sequences=False))

model.add(Dense(2))
model.add(Activation(’softmax’))
model.compile(optimizer=’adam’, 
              loss=’mse’,
              metrics=[’accuracy’])

This block is doing three things in sequence: preparing the input tensors for an RNN, assembling a stacked LSTM that reduces a sequence to a prediction vector, and compiling the model for training. The reshape at the top converts X_train and X_test into the 3‑D shape an LSTM expects: (samples, timesteps, features). Here EMB_SIZE is being used as the feature dimension per timestep, and X_train.shape[1] is the number of timesteps. That reshape is necessary because Keras LSTM layers require a timeseries axis and a feature axis so the recurrent kernel can scan across time and learn temporal dependencies.

The model is built as a Sequential stack of two LSTM layers followed by a dense output. The first LSTM is configured with return_sequences=True so it emits a hidden vector at every timestep; this is deliberate because the next LSTM in the stack needs a full sequence as input rather than a single summary. The second LSTM uses return_sequences=False (the default) so it reduces the incoming sequence into a single fixed‑length vector summarizing the learned temporal pattern across the entire input sequence. That final recurrent state is what the Dense layer uses to produce the model’s predictions. The HIDDEN_RNN parameter controls the size of the recurrent hidden state (how much capacity the RNN has to model temporal structure).

There are a few practical and semantic mismatches to call out and correct for clarity and correctness. First, the call signatures shown (input_shape=(EMB_SIZE,), input_dim=…, output_dim=…) are legacy/deprecated and also likely incorrect: the LSTM input_shape should be (timesteps, features) (for example input_shape=(X_train.shape[1], EMB_SIZE)), and you normally pass units=HIDDEN_RNN as the first positional/keyword argument rather than output_dim. Second, the Dense(2) + Activation(‘softmax’) combination produces a two‑element probability distribution (appropriate for a 2‑class classification task with one‑hot labels). That conflicts with the choice of loss=’mse’ which is a regression loss — softmax + MSE is rarely the intended pairing. For time series forecasting, if you are predicting continuous values (e.g., next one or two real‑valued steps), use Dense(n_outputs) with a linear activation and a regression loss like MSE or MAE, and evaluate with RMSE/MAE. If instead you’re doing classification (two classes), keep softmax (or a single sigmoid output) but switch the loss to categorical_crossentropy (or binary_crossentropy for a single sigmoid) and ensure labels are encoded appropriately. Also, using accuracy as a metric only makes sense for classification; for regression use MAE/RMSE.

Finally, compiling with optimizer=’adam’ is a sensible default for sequence models because Adam adapts learning rates per-parameter and tends to converge reliably on time series tasks, but you should align the loss/activation/metrics to the task objective. In short: confirm that EMB_SIZE is your feature dimension and timesteps are in X_train.shape[1], fix the LSTM input_shape/units signature, decide whether the problem is regression (use linear + MSE/MAE) or classification (use softmax/sigmoid + appropriate cross‑entropy), and pick metrics that reflect forecasting quality.

model.fit(X_train, 
          Y_train, 
          nb_epoch=5, 
          batch_size = 128, 
          verbose=1, 
          validation_split=0.1)


score = model.evaluate(X_test, Y_test, batch_size=128)
print score

This block is the train-and-evaluate step for a simple time‑series forecasting model. When model.fit(…) is called, the model begins an iterative optimization loop: for each of the nb_epoch (5) passes over the training data it processes the training set in mini‑batches of size 128, performing a forward pass to compute predictions, computing the loss against Y_train, and applying backpropagation and the optimizer to update weights. Using minibatches instead of full‑batch updates trades off noisier gradient estimates (which can help escape shallow minima) against throughput and memory use; 128 is a common mid‑range choice but should be tuned against dataset size and GPU/CPU memory. verbose=1 just enables per‑epoch logging so you can monitor progress.

validation_split=0.1 instructs the fit call to hold out 10% of the provided X_train/Y_train for validation at the end of each epoch; that validation set is not used to update weights but is used to compute validation loss/metrics so you can detect overfitting and monitor generalization during training. Note an important time‑series consideration here: Keras takes the last fraction of the (optionally shuffled) data as validation, and by default fit shuffles the data before splitting. For forecasting, you often want chronological splits (most recent data for validation/test) and to avoid shuffle=True, otherwise you risk leaking future information into training. If you need early stopping or checkpoints, you would normally attach callbacks to fit so training can stop automatically when validation metrics stop improving.

After training finishes, model.evaluate(X_test, Y_test, batch_size=128) runs the trained model on the independent test set to estimate final generalization performance. evaluate computes the loss (and any additional metrics you configured on model.compile) over the entire test set and returns them in the same order as model.metrics_names; using a batch_size here affects memory usage and throughput but not the numeric result. The printed score is therefore the test loss (and possibly metric values) and is your unbiased estimate of how well the model will forecast unseen time steps — assuming the test set was constructed without leakage.

A few practical tips tied to the “why” behind these choices: five epochs is a very small number usually used for a quick smoke test; for production forecasting you should train longer, inspect the training/validation curves (fit returns a History object), and use callbacks such as EarlyStopping/ModelCheckpoint to avoid overfitting. Also ensure your data splitting strategy preserves temporal order (no random shuffle) unless you intentionally want random subsampling, and scale/normalize inputs consistently between train/val/test to stabilize training and improve convergence.

timeseries, dates = load_snp_returns()
X, Y = split_into_chunks(timeseries, TRAIN_SIZE, TARGET_TIME, LAG_SIZE, binary=True)
X, Y = np.array(X), np.array(Y)
X_train, X_test, Y_train, Y_test = create_Xt_Yt(X, Y, percentage=0.9)

Y_train, Y_test = np.array([y.argmax() for y in Y_train]), np.array([y.argmax() for y in Y_test])


from sklearn.ensemble import RandomForestClassifier 
from sklearn import metrics, svm
from sklearn import linear_model

print ‘Training...’

#classifier = RandomForestClassifier(n_estimators = 100,
#                               n_jobs=4,
#                               verbose=1)

#classifier = linear_model.LogisticRegression(C=1e-5)

classifier = svm.SVC()
classifier.fit(X_train, Y_train)

print ‘Prediction...’
predicted = classifier.predict(X_test)
print(”Classification report for classifier %s:\n%s\n”
      % (classifier, metrics.classification_report(Y_test, predicted)))

This block implements a simple classification-based time-series forecasting pipeline: it turns historical S&P returns into supervised examples, trains a classifier to predict the direction/class of a future return, and prints standard classification metrics. The data flow and decisions are as follows.

First we load the raw series and timestamps. The call to split_into_chunks is the crucial transformation from raw sequence to supervised learning examples: it creates sliding windows of past observations (the features) and associates each window with a target label at a fixed horizon. LAG_SIZE controls how many past time steps become the feature vector (so each X row is a recent history), TARGET_TIME is how far ahead we want to predict, and TRAIN_SIZE likely controls the number of windows created. binary=True indicates we are converting continuous returns into discrete classes (e.g., up/down or a small set of buckets) rather than predicting a numeric return, because predicting direction is often more robust and more directly actionable for many trading tasks.

After chunking, X and Y are converted to numpy arrays so they can be consumed by scikit-learn. create_Xt_Yt(X, Y, percentage=0.9) splits the data into training and test sets with a 90/10 split. The code assumes a straightforward split (the function name suggests a deterministic partition rather than cross-validation); in time series contexts you typically want to ensure the split respects chronology (train on earlier data, test on later data) to avoid leaking future information into training — check the implementation of create_Xt_Yt to confirm that behavior.

Y is then transformed from one-hot or probabilistic encodings into integer class labels using argmax. This step is necessary because scikit-learn classifiers expect discrete class labels (integers) rather than one-hot vectors. Converting labels here makes the subsequent fit/predict workflow compatible with standard classifiers.

The training block shows three classifier options (RandomForest, LogisticRegression — commented out — and the chosen svm.SVC). The chosen SVM is instantiated with defaults (RBF kernel, etc.) and trained with classifier.fit(X_train, Y_train). SVMs can work well for moderate-dimensional data and non-linear decision boundaries, but they are sensitive to feature scaling and the C/gamma hyperparameters. The code currently omits any preprocessing like scaling or hyperparameter tuning, so results may be suboptimal or slow; RandomForest (commented) is a reasonable alternative that’s less sensitive to scaling and easier to interpret feature importances, while LogisticRegression is a simpler linear baseline.

Finally, the model predicts on X_test and the classification_report prints precision, recall, f1 and support per class. That report is the primary diagnostic here: it tells you how well the model predicts each class (e.g., up vs down), which is more informative for a trading decision than raw accuracy alone.

A few practical notes tied to the overall goal of simple time-series forecasting: because this pipeline casts forecasting as classification of direction, you’ve traded magnitude accuracy for robustness of directional signal. Ensure your chunking function preserves temporal ordering when splitting, scale features before SVM training (StandardScaler or similar), and add a validation step (time-series cross-validation or a separate validation period) and hyperparameter search to avoid overfitting. Also check label balance — if classes are skewed, consider class weighting or resampling.

Download source code using the button below:

Continue reading this post for free, courtesy of Onepagecode.

Or purchase a paid subscription.