Optimizing Financial Strategies: Harnessing Machine Learning for Enhanced Trading Performance
Leveraging Alpaca API and Advanced Analytics to Navigate Market Volatility and Maximize Returns
In this exploration of cutting-edge trading technology, we introduce a sophisticated trading bot, engineered to leverage a decade of financial data via the Alpaca API. The bot’s foundation is rooted in machine learning, utilizing the alpaca.getbars() function for data access and focusing on a moving average crossover strategy. This strategy, pivotal to its operation, hinges on the interaction between the 2-day and 200-day Simple Moving Averages (SMAs), a technique aimed at capturing market trends and volatilities.
The setup involves critical libraries like Pandas for data processing, Matplotlib for visualization, and SKLearn for machine learning model implementation. The article outlines the configuration of Alpaca API keys, data retrieval, preprocessing, and the application of machine learning models, including Support Vector Machines (SVM) and Logistic Regression. It delves into model training, testing on historical data, and evaluation using classification reports and return analyses, emphasizing the significance of feature scaling and selection. The culmination of this technical journey is the analysis of trade signals and the financial efficacy of the strategy, measured by profit/loss and ROI metrics, presenting a nuanced blend of algorithmic trading and machine learning.
Download the source code from the link in comment section
This bot is a sophisticated algorithm that utilizes 10 years of financial data obtained from the Alpaca API. It employs the alpaca.getbars() function, which can access up to 1000 trading days of data.
For training, the bot uses a one-year period. This period is divided such that 75% covers the time leading up to the pandemic-induced market crash, and the remaining 25% includes the crash period and the initial phase of the market recovery.
The trading strategy of the bot is based on moving average crossovers. It executes trades when the 2-day Simple Moving Average (SMA) intersects with the 200-day SMA.
# Import the required libraries and dependencies
import os
import requests
import pandas as pd
from dotenv import load_dotenv
import alpaca_trade_api as tradeapi
%matplotlib inline
from alpaca_trade_api.rest import TimeFrame
import numpy as np
from pathlib import Path
import hvplot.pandas
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from pandas.tseries.offsets import DateOffset
from sklearn.metrics import classification_report
This code imports all the necessary libraries and dependencies, such as os, requests, pandas, dotenv, alpaca_trade_api, matplotlib, numpy, standard scaler, pandas.tseries to support the code that follows. Then, the matplotlib library is set to display plots inline. It also imports TimeFrame from alpaca_trade_api.rest. The code also imports the svm function from sklearn for support vector machine and the classification_report function for evaluating the performance of a given classifier. The code also sets a DateOffset from pandas.tseries to use for time-series data. Additionally, the standard scaler function from the sklearn library is imported for standardization of datasets. Finally, the code sets the Path library from pathlib to provide file and directory path handling functionality.
Step 1: In the root directory of the `Unsolved` folder, generate a `.env` file. This file is designated for storing your Alpaca API keys and secret keys.
In Step 2, you will need to integrate the Alpaca API and secret keys into the decisive_probability_distributions.ipynb
file. Start by assigning the values of these keys to variables with corresponding names. To achieve this, begin with invoking the load_dotenv()
function to load the environment variable. Then, assign the values of the environment variables to alpaca_api_key
and alpaca_secret_key
. Finally, ensure that these variables are correctly set up and accessible by verifying the type
of each variable.
# Set Alpaca API key and secret by calling the os.getenv function and referencing the environment variable names
# Set each environment variable to a notebook variable of the same name
alpaca_api_key = os.getenv("ALPACA_API_KEY")
alpaca_secret_key = os.getenv("ALPACA_SECRET_KEY")
# Check the values were imported correctly by evaluating the type of each
type(alpaca_api_key)
type(alpaca_secret_key)
This python code sets the Alpaca API key and secret by calling the os.getenv function and referencing the environment variable names. It then sets each environment variable to a notebook variable of the same name, making it easier to use and access in the code. After setting the variables, the code checks to ensure that the values were imported correctly by evaluating the type of each variable and printing the result. This is important because it ensures that the correct data type was used and that the values were successfully accessed from the environment variables. This code is useful when working with sensitive information, such as API keys and secret keys, as it allows the user to securely and easily access and use them in their code.
For Step 3, you will establish the Alpaca API REST object. This is done by utilizing the Alpaca tradeapi.REST function. During this process, you will need to configure the function by setting the parameters alpaca_api_key, alpaca_secret_key, and api_version. This step is essential for initializing the REST object with the correct credentials and settings.
# Create your Alpaca API REST object by calling Alpaca's tradeapi.REST function
# Set the parameters to your alpaca_api_key, alpaca_secret_key and api_version="v2"
alpaca = tradeapi.REST(
alpaca_api_key,
alpaca_secret_key,
api_version="v2")
This code creates an API object using Alpacas tradeapi.REST function. It then sets the parameters for the alpaca_api_key, alpaca_secret_key, and api_version=v2, allowing the API object to access data and perform trading actions on behalf of the user. This API object will be used to make requests to Alpacas trade API, allowing the user to manage their account and make trades using the Alpaca platform.
In Step 4, you will leverage the Alpaca SDK to perform an API call that retrieves a year’s worth of daily stock data, spanning from May 1, 2019, to May 1, 2020, for selected stock tickers. Begin by defining the required tickers. Next, determine the start_date and end_date by using the pd.Timestamp function, ensuring these dates are set from May 1, 2019, to May 1, 2020. You should then specify the timeframe value as 1 day. Finally, create the portfolio_prices_df DataFrame. This DataFrame should be established by assigning it to the result of the alpaca.get_barset function, with the previously set parameters.
# Create the list for the required tickers
tickers = ["SPY"]
The code creates an empty list named tickers and assigns one element, SPY to the list. This signifies that the user is interested in information related to SPY, most likely a stock or financial instrument. This list could be used later on in the code to store information or perform calculations related to the SPY ticker.
# Set the values for start_date and end_date using the pd.Timestamp function
# The start and end data should be 2019-05-01 to 2020-05-01
# Set the parameter tz to "America/New_York",
# Set this all to the ISO format by calling the isoformat function
start_date = pd.Timestamp("2012-04-12", tz="America/New_York").isoformat()
end_date = pd.Timestamp("2022-04-12", tz="America/New_York").isoformat()
This code sets the start and end dates to be used in a time series analysis. It does this by calling the pd.Timestamp function and passing in specific dates 2019–05–01 and 2020–05–01. The tz parameter is set to America/New_York which specifies the time zone for these dates. Finally, the isoformat function is called to convert these dates into the ISO format which is commonly used for date and time representations. By setting the start and end dates in this way, the code ensures that any time series analysis performed will use the correct time zone and format for accurate results.
# Use the Alpaca get_barset function to gather the price information for each ticker
# Include the function parameters: tickers, timeframe, start, end, and limit
# Be sure to call the df property to ensure that the returned information is set as a DataFrame
prices_df = alpaca.get_bars(
tickers,
TimeFrame.Day,
start=start_date,
end=end_date
).df.iloc[:1000]
#api.get_bars("AAPL", TimeFrame.Hour, "2021-06-01", "2021-06-01").df.iloc[:10]
# Review the first five rows of the resulting DataFrame
prices_df.head()
This code uses the Alpaca get_barset function to retrieve price information for a list of tickers within a specified time frame. The function requires parameters such as the list of tickers, timeframe, start and end dates, and limit for the number of data points to be retrieved. The resulting data is then stored as a dataframe using the df property. The code further uses the .iloc method to select the first 1000 rows of the dataframe. The commented out code below shows an example of how the get_barset function is called with specific parameters. Finally, the .head method is used to review the first five rows of the resulting dataframe.
Create Single Dataframe with returns from close prices
# Filter the date index and close columns
signals_df = prices_df.loc[:, ["close"]]
# Use the pct_change function to generate returns from close prices
signals_df["Actual Returns"] = signals_df["close"].pct_change()
# Drop all NaN values from the DataFrame
signals_df = signals_df.dropna()
# Review the DataFrame
display(signals_df.head())
display(signals_df.tail())
First, it creates a new DataFrame called signals_df that only includes the close column from an existing DataFrame called prices_df. This allows for a more focused analysis on the closing price data. Then, the pct_change function is applied to the close column in signals_df to calculate the percentage change in price between each day. This generates a new Actual Returns column in signals_df. Next, the code drops any rows in signals_df that contain NaN not a number values. This ensures that the data is clean and can be properly analyzed. Finally, the code displays the first and last five rows of signals_df to provide a visual representation of the processed data. This code would be helpful for analyzing the performance of a specific stock over a period of time and identifying potential trends or patterns in the stocks price movements.
Generate Trading Signals Using Short And Long Window SMA Value
# Set the short window and long window
short_window = 4
long_window = 400
# Generate the fast and slow simple moving averages (4 and 100 days, respectively)
signals_df['SMA_Fast'] = signals_df['close'].rolling(window=short_window).mean()
signals_df['SMA_Slow'] = signals_df['close'].rolling(window=long_window).mean()
signals_df = signals_df.dropna()
# Review the DataFrame
display(signals_df.head())
display(signals_df.tail())
This code is used to generate simple moving average SMA signals for a financial dataset. The first two lines set the values for the short and long windows, which represent the time periods over which the moving averages will be calculated. The next two lines use the rolling method to calculate the fast and slow SMAs based on the closing prices of the financial data. The resulting SMAs are then added as new columns to the original dataset. The code then drops any rows with null values, and the last two lines display the top and bottom rows of the updated dataset. This allows for a visual review of the SMA signals and their corresponding price movements to help make investment decisions.
# Initialize the new Signal column
signals_df['Signal'] = 0.0
# When Actual Returns are greater than or equal to 0, generate signal to buy stock long
signals_df.loc[(signals_df['Actual Returns'] >= 0), 'Signal'] = 1
# When Actual Returns are less than 0, generate signal to sell stock short
signals_df.loc[(signals_df['Actual Returns'] < 0), 'Signal'] = -1
# Review the DataFrame
display(signals_df.head())
display(signals_df.tail())
The first line of code initializes a new column called Signal in a DataFrame called signals_df and sets all values in that column to 0.0. The next line of code uses a conditional statement to check if the actual returns of the stock are greater than or equal to 0. If they are, a signal to buy the stock long is generated and the corresponding value in the Signal column is set to 1. Similarly, the following line generates a signal to sell the stock short if the actual returns are less than 0 and sets the corresponding values in the Signal column to -1. The last two lines of code are used to display the DataFrame before and after the signals are generated.
signals_df['Signal'].value_counts()
This code returns a count of each unique value in the Signal column of the signals_df DataFrame. The result is a table displaying the unique values in the Signal column as well as the number of times each value appears in the column. This can be useful for understanding the distribution of data in the Signal column and identifying any dominant or rare values. It also allows for easy comparison between the different signal values to see which may be more prevalent.
# Calculate the strategy returns and add them to the signals_df DataFrame
signals_df['Strategy Returns'] = signals_df['Actual Returns'] * signals_df['Signal'].shift()
# Review the DataFrame
display(signals_df.head())
display(signals_df.tail())
First, it takes the values from the columns Actual Returns and Signal and multiplies them together, then it shifts the values in the Signal column by one row. This is done to align the signal with the return from the previous time period. The resulting values are then added to a new column called Strategy Returns in the signals_df DataFrame. This process is repeated for each row in the DataFrame, resulting in a complete set of strategy returns for all the time periods in the data. Finally, the DataFrame is displayed with the new column showing the calculated strategy returns. The first and last few rows of the DataFrame are also displayed to give an idea of the overall data and how the strategy returns vary over time.
# Plot Strategy Returns to examine performance
(1 + signals_df['Strategy Returns']).cumprod().plot()
The code is used to plot the performance of a strategy by calculating the cumulative return of the strategy over time. It first calculates the strategy returns by adding 1 to the values in the Strategy Returns column of the signals_df data frame. This helps to account for periods where the returns may be negative, as adding 1 ensures that the cumulative return will never be negative. The cumulative return is then calculated by taking the cumulative product of the strategy returns using the .cumprod function. Finally, the plot function is used to create a line plot of the cumulative returns over time, allowing for easy visual analysis of the strategys performance. This code is useful for evaluating the effectiveness of a trading strategy and determining if it should be continued or modified.
Split The Data Into Training And Testing Datasets
# Assign a copy of the sma_fast and sma_slow columns to a features DataFrame called X
X = signals_df[['SMA_Fast', 'SMA_Slow']].shift().dropna()
# Review the DataFrame
X.head()
This code creates a new DataFrame called X by selecting the columns SMA_Fast and SMA_Slow from a previous DataFrame called signals_df. These columns are then shifted by one position and any resulting rows with missing data are removed. Finally, the first few rows of the new DataFrame are displayed using the .head function. This code can be used to analyze moving averages, as well as for other data analysis tasks.
# Create the target set selecting the Signal column and assiging it to y
y = signals_df['Signal']
# Review the value counts
y.value_counts()
This code creates a target set, which is a specific set of data that is used for a particular purpose, by selecting the Signal column from a DataFrame called signals_df. This column contains a specific type of data that is being analyzed or used in the code. The code then assigns this selected Signal column to a variable called y. Once this is done, the code reviews the value counts of the variable y. This means it calculates the number of instances of each potential value that exists in the Signal column. This information can be useful for understanding the distribution and range of data within the Signal column and can assist in the analysis or manipulation of the data.
# Select the ending period for the training data with an offset of 3 months
training_end = X.index.min() + DateOffset(months=3)
# Display the training end date
print(training_end)
This code selects the ending period for the training data by using an offset of 3 months, meaning it will be 3 months after the minimum index value of X. This can be used to create a training data set that excludes the most recent data, allowing for better training and accuracy when testing on new data. The DateOffset function allows for easy manipulation of dates in Python, in this case adding 3 months to the minimum index date. The final line of code prints the selected training end date, providing a visual confirmation of the date.
# Generate the X_train and y_train DataFrames
X_train = X.loc[training_begin : training_end]
y_train = y.loc[training_begin : training_end]
# Review the X_train DataFrame
X_train.head()
This code generates a training set using a subset of the original data and assigns it to the X_train and y_train DataFrames. The X_train DataFrame is created by taking a subset of the X DataFrame, specifically the rows between the indices specified by the variables training_begin and training_end. Similarly, the y_train DataFrame is created by taking a subset of the y DataFrame between the same indices. Finally, the code prints the first few rows of the X_train DataFrame for review. This process is commonly used in machine learning and data analysis tasks to split a dataset into a training set and a test set, allowing for model training and evaluation.
# Generate the X_test and y_test DataFrames
X_test = X.loc[training_end+DateOffset(hours=1):]
y_test = y.loc[training_end+DateOffset(hours=1):]
# Review the X_test DataFrame
X_train.head()
This code generates two DataFrames, X_test and y_test, using the previously defined DataFrame X and a function called DateOffset. This function is used to specify a time offset, in this case adding an hour, to the index of the X DataFrame. This means that the X_test DataFrame will contain all data from the X DataFrame starting from the hour after the value of training_end. The y_test DataFrame will contain the corresponding values for the y DataFrame. Finally, the code prints the first few rows of the X_train DataFrame for review. Overall, this code appears to be preparing the data for a time series analysis by creating a training set and a test set based on a specified offset.
# Scale the features DataFrames
# Create a StandardScaler instance
scaler = StandardScaler()
# Apply the scaler model to fit the X-train data
X_scaler = scaler.fit(X_train)
# Transform the X_train and X_test DataFrames using the X_scaler
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
This code is used to standardize the data in order to prepare it for analysis. First, a StandardScaler instance is created, which will be used to transform the data. The scaler model is then applied to fit the X_train data, which means that the scaler will calculate the mean and standard deviation of the data in order to determine how it needs to be transformed. Once the scaler is fitted to the data, it can then be used to transform the X_train and X_test DataFrames using the X_scaler. This means that the data will be standardized using the calculated mean and standard deviation. The transformed data is then stored in new X_train_scaled and X_test_scaled DataFrames, which can now be used for further analysis. This process helps to make sure that the data is on a more comparable scale, which can improve the accuracy of any analysis or predictive models.
In Step 4, you are to utilize the SVC
classifier model, which is a part of SKLearn's support vector machine (SVM) learning methodology. Your task involves fitting this model to the training data and then using it to make predictions based on the testing data. Once the predictions are generated, take the time to thoroughly review and analyze them. This step is crucial for understanding how well the model performs and for gaining insights from the data predictions.
# From SVM, instantiate SVC classifier model instance
svm_model = svm.SVC()
# Fit the model to the data using the training data
svm_model = svm_model.fit(X_train_scaled, y_train)
# Use the testing data to make the model predictions
svm_pred = svm_model.predict(X_test_scaled)
# Review the model's predicted values
svm_pred[:10]
This code is using Support Vector Machines SVM to create a classification model. First, the SVM classifier model is instantiated and assigned to the variable svm_model. Then, the model is trained using the training data X_train_scaled and corresponding labels y_train. This is done by calling the fit function on the model and passing in the training data. Next, the model is used to make predictions on the testing data X_test_scaled and the predicted values are assigned to the variable svm_pred using the predict function. Finally, to evaluate the models performance, the first 10 predicted values are printed using the numpy array slicing syntax svm_pred[:10]. Overall, this code is using SVM to create a classification model and evaluate its accuracy on unseen test data.
In Step 5, your focus will be on evaluating the classification report that is linked with the predictions made by the SVC
model. This review is an important aspect of the process as it provides detailed insights into the performance and accuracy of the model's predictions. Analyzing this report will help you understand key metrics and assess the effectiveness of the model in classifying the data.
# Use a classification report to evaluate the model using the predictions and testing data
svm_testing_report = classification_report(y_test, svm_pred)
# Print the classification report
print(svm_testing_report)
This code evaluates a model by using the predictions made by the model and comparing them to the actual testing data. The classification report function is used to generate a report of various metrics like precision, recall, and F1 score, which can be used to assess the performance of the model. The function takes in two parameters — the predicted values svm_pred and the actual values y_test. The resulting report is stored in the variable svm_testing_report. Finally, the report is printed using the print function. This allows the developer to easily view and analyze the performance of the model and make any necessary improvements.
In Step 6, the task involves constructing a DataFrame for predictions. This DataFrame should include specific columns designated for Predicted values, Actual Returns, and Strategy Returns. Creating this structure will allow for an organized and clear presentation of the prediction outcomes, actual market returns, and the returns generated by the implemented trading strategy. This organization is essential for a comprehensive analysis and comparison of the model’s predictions against actual market performance.
# Create a predictions DataFrame
predictions_df = pd.DataFrame(index=X_test.index)
# Add the SVM model predictions to the DataFrame
predictions_df['Predicted'] = svm_pred
# Add the actual returns to the DataFrame
predictions_df['Actual Returns'] = signals_df['Actual Returns']
# Add the strategy returns to the DataFrame
predictions_df['Strategy Returns'] = predictions_df['Actual Returns'] * svm_pred
# Review the DataFrame
display(predictions_df.head())
display(predictions_df.tail())
This code first creates a dataframe called predictions_df using the pandas library, setting its index to match that of the test dataset, X_test. Next, it adds the predictions made by a Support Vector Machine SVM model to the dataframe under the column name Predicted. The actual returns from a dataset called signals_df are then added to the dataframe under the column name Actual Returns. Finally, the strategy returns are added to the dataframe by multiplying the actual returns with the SVM predictions, and this is added under the column name Strategy Returns. The head and tail methods are used to display the first and last five rows of the dataframe, respectively.