How LSTM, ARIMA, and MCMC Can Be Integrated for Superior Stock Price Predictions
An In-Depth Look at LSTM, ARIMA, and MCMC for Enhanced Prediction Accuracy
Link to download jupyter notebook at the end of this article.
The stock market encompasses a variety of markets and exchanges where transactions related to the buying, selling, and issuance of shares in publicly-held companies occur. A stock, or share, also referred to as a companys equity, is a financial instrument that signifies ownership in a corporation, granting the holder a proportional claim on the company’s assets and earnings.
The prediction of future stock prices is a subject of substantial interest across numerous disciplines such as trading, finance, statistics, and computer science. The primary aim of this research is to forecast price movements effectively, allowing investors to make informed decisions regarding the purchase and sale of stocks to maximize their profits.
I utilize the yfinance package for data scraping, which is straightforward to install and operate.
Yfinance is a widely-used open-source library created by Ran Aroussi that facilitates access to financial data provided by Yahoo Finance. This platform offers a comprehensive array of market information, including data related to stocks, bonds, currencies, and cryptocurrencies. Additionally, it provides market news, reports, analysis, options data, and fundamental data, distinguishing it from some of its competitors.
For further information, the documentation can be accessed at the Yfinance website.
!pip install yfinance --quiet
!pip install pmdarima --quiet
The code is a command line instruction commonly used within a Python environment, particularly in platforms such as Jupyter notebooks that allow the execution of shell commands.
This command facilitates the installation of two Python libraries, namely yfinance and pmdarima. The inclusion of the — quiet flag ensures that the output and progress information are suppressed, resulting in a cleaner execution display.
The command utilizes !pip install, which activates the Python package manager (pip) to obtain packages from the Python Package Index (PyPI). By specifying the package names, yfinance and pmdarima, pip is able to identify which libraries to download and install. The — quiet option serves to reduce the amount of output generated during the installation process, thereby minimizing distractions.
The rationale for using this code is rooted in the functionalities provided by the installed libraries. yfinance allows users to download financial data from Yahoo Finance, making it instrumental for the analysis of stock prices, historical data, and other financial indicators. On the other hand, pmdarima streamlines the process of developing ARIMA (AutoRegressive Integrated Moving Average) models for time series forecasting by offering tools for model selection and diagnostics. In summary, installing these libraries is vital for individuals engaged in financial analysis and predictive modeling based on historical financial data, as they equip users with the essential tools and capabilities to perform such tasks proficiently.
PuLP is a Python library designed for addressing linear programming problems. While it is commonly utilized for relatively straightforward tasks, it also proves advantageous for predictions blending.
The library utilizes the CBC solver by default; however, there are several alternative solvers available, including CPLEX, Gurobi, and GLPK.
For additional information, you may consult the official documentation available at the following link: [Documentation Source](http://coin-or.github.io/pulp/).
!pip install statsmodels==0.11.0rc1 --quiet
!pip install -Iv pulp==1.6.8 --quiet
The code comprises a series of commands intended to install particular versions of two Python libraries: statsmodels and pulp, utilizing the package manager known as pip.
The first command executes the installation of statsmodels, specifically version 0.11.0rc1. This version represents a release candidate, which indicates that it is nearly finalized, although it may still include last-minute updates or bug fixes. The second command focuses on the installation of pulp, an optimization library for Python, and specifies the version as 1.6.8. The inclusion of the -Iv flag indicates that the installation should occur in verbose mode, thereby providing more comprehensive output during the installation process.
To facilitate the installation, the pip install command is implemented, allowing for the retrieval of packages from the Python Package Index (PyPI) or alternative package repositories. The use of the — quiet flag minimizes the output during the installation, which results in a cleaner console or script output. By specifying particular versions of the libraries to be installed, users can ensure that the installed versions are compatible with other packages or code that may require certain functionalities or features inherent to those specific versions.
The necessity of utilizing this code stems from the requirement to establish a specific development environment with defined package versions. The practice of using specified versions is critical for maintaining compatibility with existing code, mitigating potential issues associated with upgrades, and ensuring reproducibility in data analysis or modeling efforts. This approach is especially vital in professional environments, research settings, or projects where dependencies may be disrupted by more recent releases or modifications to those libraries.
Gathering train data is essential for ensuring the efficiency and effectiveness of rail transportation systems. This process involves compiling various forms of information, including schedules, routes, and performance metrics. Accurate data collection enables stakeholders to make informed decisions regarding operations, maintenance, and improvements to the rail network.
In addition, train data collection plays a vital role in enhancing passenger experience and safety. By analyzing this data, operators can identify trends, address potential issues, and implement strategies to optimize service delivery.
Ultimately, a thorough approach to collecting and analyzing train data contributes to the overall advancement of the rail industry, benefiting both operators and the traveling public.
import yfinance as yf
# getting data from Yahoo Finance
stock_name = 'AMD' # here you can change the name of stock ticker, for example we will take AMD ticker
data = yf.download(stock_name, start="2020-03-26", end="2021-03-29")
The code snippet is intended to retrieve historical stock price data for a specific company from Yahoo Finance utilizing the yfinance library in Python.
To begin with, the code imports the yfinance library, which serves as a module for obtaining financial data from Yahoo Finance. Subsequently, a variable named stock_name is established and assigned the string AMD, denoting the ticker symbol for Advanced Micro Devices, Inc. This ticker symbol may be modified to access data pertaining to other stocks.
The code then employs the yf.download() function, specifying the stock ticker along with a defined date range, which includes a start date of March 26, 2020, and an end date of March 29, 2021. This function is tasked with retrieving daily historical stock data for the indicated period.
The main purpose of utilizing this code is to gather historical stock price data for analytical purposes. Such information can serve various objectives, including financial analysis, algorithmic trading, portfolio management, or academic research. By programmatically retrieving this data, investors and analysts can streamline the information-gathering process, thereby conserving time and effort in comparison to traditional methods of manual data collection.
from sklearn.preprocessing import MinMaxScaler
import math
import matplotlib.pyplot as plt
import keras
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import *
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.layers import Dense, LSTM
def lstm(stock_name, data):
# Choose only Close price of stock
data = data.filter(['Close'])
dataset = data.values
# Train data - 80%, test - 20%
training_data_len = int(np.ceil( len(dataset) * .80 ))
# Scale our data from 0 to 1
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
# Use our scaled data for training
train_data = scaled_data[0:int(training_data_len), :]
x_train = []
y_train = []
for i in range(60, len(train_data)):
x_train.append(train_data[i-60:i, 0])
y_train.append(train_data[i, 0])
if i<= 61:
print(x_train)
print(y_train)
print()
x_train, y_train = np.array(x_train), np.array(y_train)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# Build LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape = (x_train.shape[1], 1)))
model.add(Dropout(0.35))
model.add(LSTM(64, return_sequences=False))
model.add(Dropout(0.3))
model.add(Dense(25, activation = 'relu'))
model.add(Dense(1))
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
# Тrain the model
model.fit(x_train, y_train, batch_size=1, epochs=21)
# Structure of the model
keras.utils.plot_model(model, 'multi_input_and_output_model.png', show_shapes=True)
# Create test dataset
test_data = scaled_data[training_data_len - 60: , :]
x_test = []
y_test = dataset[training_data_len:, :]
for i in range(60, len(test_data)):
x_test.append(test_data[i-60:i, 0])
x_test = np.array(x_test)
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))
# Predict on test data
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)
# For finding error we use RMSE formula, but MSE can be used too
rmse = np.sqrt(np.mean(((predictions - y_test) ** 2)))
print(f'RMSE LSTM: {rmse}')
# Graphs
train = data[:training_data_len]
valid = data[training_data_len:]
#valid['Predictions'] = predictions
train_gr = np.reshape(train, (203,))
train_gr = train_gr['Close']
valid_gr = np.reshape(valid, (50,))
valid_gr = valid_gr['Close']
preds_gr = np.reshape(predictions, (50,))
x_train = list(range(0, len(train_data)))
x_valid = list(range(len(train_data)-1, len(dataset)))
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_train, y=train_gr, mode='lines+markers', marker=dict(size=4), name='train', marker_color='#39304A'))
fig.add_trace(go.Scatter(x=x_valid, y=valid_gr, mode='lines+markers', marker=dict(size=4), name='valid', marker_color='#A98D75'))
fig.add_trace(go.Scatter(x=x_valid, y=preds_gr, mode='lines+markers', marker=dict(size=4), name='predictions', marker_color='#FFAA00'))
fig.update_layout(legend_orientation="h",
legend=dict(x=.5, xanchor="center"),
plot_bgcolor='#FFFFFF',
xaxis=dict(gridcolor = 'lightgrey'),
yaxis=dict(gridcolor = 'lightgrey'),
title_text = f'{stock_name} LSTM data', title_x = 0.5,
xaxis_title="Timestep",
yaxis_title="Stock price",
margin=dict(l=0, r=0, t=30, b=0))
fig.show()
# Predict stock prices for next moth
data_new = yf.download(stock_name, start="2021-03-01", end="2021-04-30")
data_new = data_new.filter(['Close'])
dataset = data_new.values
training_data_len = len(dataset)
scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)
test_data = scaled_data[training_data_len - len(data_new): , :]
x_test = []
y_test = dataset[training_data_len:, :]
for i in range(20, len(test_data)):
x_test.append(test_data[i-20:i, 0])
x_test = np.array(x_test)
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))
hist_data_new = yf.download(stock_name, start="2021-04-01", end="2021-05-04")
hist_data_new = hist_data_new.drop(['Open', 'High', 'Low', 'Adj Close', 'Volume'], axis=1)
hist_data_new = hist_data_new['Close']
hist_data_new = np.array(hist_data_new)
pred_lstm = model.predict(x_test)
pred_lstm = pred_lstm[:-1]
pred_lstm = scaler.inverse_transform(pred_lstm)
# build graphs
preds_gr = np.reshape(pred_lstm, (22,))
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(0, 21)), y=hist_data_new, mode='lines+markers', name='historical', marker_color='#39304A'))
fig.add_trace(go.Scatter(x=list(range(0, 21)), y=preds_gr, mode='lines+markers', name='predictions', marker_color='#FFAA00'))
fig.update_layout(legend_orientation="h",
legend=dict(x=.5, xanchor="center"),
plot_bgcolor='#FFFFFF',
xaxis=dict(gridcolor = 'lightgrey'),
yaxis=dict(gridcolor = 'lightgrey'),
title_text = f'{stock_name} LSTM prediction', title_x = 0.5,
xaxis_title="Timestep",
yaxis_title="Stock price",
margin=dict(l=0, r=0, t=30, b=0))
fig.show()
return pred_lstm, rmse
The code presented utilizes a Long Short-Term Memory (LSTM) neural network to forecast stock prices by analyzing historical data. Below is an explanation of the codes functioning, methodology, and significance.
To begin with, the code undertakes the preparation of data, specifically targeting the Close prices of historical stock records. It organizes this data for input into the LSTM network, dividing it into training and testing sets, allocating 80% for training and the remaining 20% for testing purposes.
Further, the code implements normalization of the data through a MinMaxScaler, which adjusts the Close prices to a range between 0 and 1. This preprocessing step is essential as it enhances the performance and convergence rate of the neural networks.
The construction of the LSTM model follows, where the architecture is designed to encapsulate the temporal dependencies inherent in the stock price data. This configuration includes two LSTM layers, dropout layers for regularization, and dense layers for producing output.
During the model training phase, the prepared training data is utilized, where the model leverages the historical context drawn from the preceding 60 time steps to predict the subsequent price point.
After the training process, the model’s performance is evaluated against the testing data. This involves predicting stock prices and calculating the Root Mean Square Error (RMSE), serving as a measure of the models accuracy.
Additionally, the code produces visual representations such as graphs, allowing for a comparison between actual stock prices and those predicted by the model, for both the training and validation datasets. It also extends its functionality by retrieving further historical data to forecast stock prices for the upcoming month, applying identical preprocessing steps and the previously trained model to achieve future price predictions, which are then plotted alongside historical prices.
The workings of the code reflect the LSTM networks specialized design for managing time series data, making it particularly suitable for forecasting tasks. The use of memory cells enables the retention of crucial information over extended sequences, allowing the model to discern patterns over time. Employing the MinMaxScaler plays a vital role in reformatting the data into an appropriate range for the neural network, which facilitates improved learning outcomes. Furthermore, the inclusion of dropout layers in the models architecture aids in preventing overfitting by randomly removing neurons during the training process. The input data is reshaped into a three-dimensional format (samples, time steps, features) required by the LSTM layers, ensuring the accurate processing of historical sequences.
The utility of this code lies in its practical application of deep learning for stock price forecasting, an area of significant relevance in finance and trading. By providing accurate stock price predictions, the code enables investors to make well-informed decisions, which may lead to enhanced returns. Moreover, it offers insights into stock market behavior derived from historical trends, empowering both businesses and individuals to formulate effective investment strategies.
lstm_pred, lstm_rmse = lstm(stock_name, data)
The code is here to invoke a function named lstm, which takes two arguments: stock_name and data. This function likely implements a Long Short-Term Memory (LSTM) neural network model, a technique frequently employed for time series predictions, particularly within the context of financial datasets such as stock prices.
The invocation of the lstm function executes the procedures for creating, training, and assessing the performance of an LSTM model. The two input parameters serve specific roles in this process. The first parameter, stock_name, likely denotes the ticker symbol or designation of the stock for which price predictions are being generated. The second parameter, data, presumably refers to a dataset that comprises historical stock price information along with other relevant features that may aid in prediction.
The outputs of the function consist of two key elements. The first, lstm_pred, likely represents the models forecasts of future stock prices based on the historical data provided. These predictions could cover a designated time frame. The second output, lstm_rmse, denotes the root mean square error, a widely recognized metric for evaluating regression model performance. This metric offers insight into how accurately the LSTM models predictions align with the actual stock prices, with a lower RMSE signifying superior model performance.
The motivation for utilizing this code stems from the increasing significance of predictive analytics in the financial sector. Accurate forecasting of stock prices can significantly influence investment decisions, enhance risk management approaches, and improve portfolio management strategies. The LSTM model is particularly well-suited for this endeavor because it excels at learning from sequential data and capturing temporal dependencies, making it an effective tool for time series forecasting in the realm of stock price prediction.
data_adf = data.drop(['Open', 'High', 'Low', 'Adj Close', 'Volume'], axis=1)
data_adf = data_adf['Close']
from pmdarima.arima import ADFTest
adf_test = ADFTest(alpha = 0.05)
adf_test.should_diff(data_adf)
This code conducts a statistical test aimed at evaluating whether a time series dataset, specifically the Close prices from a financial collection, exhibits stationarity. The process and its objectives can be explained as follows.
Initially, the code prepares a dataset referred to as data_adf. This preparation involves removing extraneous columns from the original dataset, which is labeled data, thereby retaining only the Close prices. This step is crucial because the Augmented Dickey-Fuller (ADF) test focuses specifically on the time series of interest, which, in this instance, pertains to the closing prices of a financial security.
Following this, the code imports the ADFTest from the pmdarima library. This particular class is specifically designed to execute the ADF test, a statistical method utilized to examine whether a unit root exists in a univariate time series.
The subsequent step involves conducting the ADF test itself. A new instance of ADFTest is created, establishing a significance level (alpha) of 0.05. This indicates a 5% threshold for determining statistical significance. The method should_diff() is then applied to the Close prices to ascertain whether differencing is required in order to achieve stationarity.
The rationale behind implementing this code stems from the need to accurately analyze and model time series data. Numerous forecasting techniques, particularly ARIMA (AutoRegressive Integrated Moving Average) models, necessitate that the data be stationary, meaning that its statistical characteristics, such as mean and variance, do not fluctuate over time. By identifying whether differencing is necessary, the data can be appropriately prepared for further analysis or modeling, ultimately leading to more precise forecasts and insights.
Based on the analysis, it is evident that the data exhibits stationarity, which allows for the application of regression models for forecasting purposes. Consequently, we can proceed to implement the ARIMA (Auto ARIMA) model.
import os
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pmdarima as pm
plt.style.use('fivethirtyeight')
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima import ADFTest
from pmdarima.datasets import load_wineind
import random
def arima(stock_name, data):
df_close = data['Close']
# Split data into train and test set (90% - train, 10% - test)
df_log = df_close
#train_data, test_data = df_log[3:int(len(df_log) * 0.9)], df_log[int(len(df_log) * 0.9):]
train_data, test_data = df_log[3:int(len(df_log) * 0.9)], df_log[int(len(df_log) * 0.9):]
test_values = len(df_log) * 0.01 + 1.0
x_train = list(range(0, 224))
x_test = list(range(224, int(len(data))))
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_train, y=train_data, mode='lines+markers', marker=dict(size=4), name='train', marker_color='#39304A'))
fig.add_trace(go.Scatter(x=x_test, y=test_data, mode='lines+markers', marker=dict(size=4), name='test', marker_color='#A98D75'))
fig.update_layout(legend_orientation="h",
legend=dict(x=.5, xanchor="center"),
plot_bgcolor='#FFFFFF',
xaxis=dict(gridcolor = 'lightgrey'),
yaxis=dict(gridcolor = 'lightgrey'),
title_text = f'{stock_name} ARIMA data', title_x = 0.5,
xaxis_title="Timestep",
yaxis_title="Stock price",
margin=dict(l=0, r=0, t=30, b=0))
fig.show()
model = pm.auto_arima(df_log,start_p=0, d=None, start_q=0,
max_p=5, max_d=5, max_q=5, start_P=0,
D=1, start_Q=0, max_P=5, max_D=5,
max_Q=5, m=7, seasonal=True,
error_action='warn',trace = True,
supress_warnings=True,stepwise = True,
random_state=20,n_fits = 50 )
model.summary()
exo_data = data['Volume']
exo_data = exo_data[int(len(exo_data) * 0.9):]
preds = model.predict(n_periods = 22, X = exo_data)
preds = np.vstack(preds)
hist_data = yf.download(stock_name, start="2021-04-01", end="2021-05-04")
hist_data = hist_data.drop(['Open', 'High', 'Low', 'Adj Close', 'Volume'], axis=1)
hist_data = hist_data['Close']
hist_data = np.array(hist_data)
rmse = np.sqrt(np.mean(((preds - hist_data) ** 2)))
print(f'RMSE ARIMA: {rmse}')
# build graphs
preds_gr = np.reshape(preds, (22,))
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(range(0, 21)), y=hist_data, mode='lines+markers', name='historical', marker_color='#39304A'))
fig.add_trace(go.Scatter(x=list(range(0, 21)), y=preds_gr, mode='lines+markers', name='predictions', marker_color='#FFAA00'))
fig.update_layout(legend_orientation="h",
legend=dict(x=.5, xanchor="center"),
plot_bgcolor='#FFFFFF',
xaxis=dict(gridcolor = 'lightgrey'),
yaxis=dict(gridcolor = 'lightgrey'),
title_text = f'{stock_name} ARIMA prediction', title_x = 0.5,
xaxis_title="Timestep",
yaxis_title="Stock price",
margin=dict(l=0, r=0, t=30, b=0))
fig.show()
return preds, rmse
The function defined in the code, referred to as arima, is designed for time series forecasting of a stocks closing prices utilizing the ARIMA (AutoRegressive Integrated Moving Average) model. The function serves multiple purposes, including forecasting future stock prices based on historical data, which is especially beneficial for traders and investors seeking to enhance their investment decisions.
In terms of its functionality, the code begins with data preparation. It processes stock data, specifically focusing on the closing prices, which are divided into a training set and a testing set. The training set comprises 90% of the data, while the remaining 10% is allocated for assessing the accuracy of the models predictions.
The next step involves visualizing the data through the Plotly library. This allows users to gain a clear understanding of the historical prices as well as how the data has been organized for training and testing purposes. Following this, the function advances to model building and fitting. It utilizes the pmdarima library’s auto_arima function, which automates the process of selecting the optimal ARIMA model by examining various parameters. This function conducts a stepwise search to identify the most suitable configurations for both seasonal and non-seasonal parameters of the ARIMA model.
Once the model has been fitted, it generates predictions for a specified number of future periods, which in this instance consists of 22 periods, likely indicating forthcoming days. The model also integrates exogenous variables, such as trading volume, to enhance the accuracy of its forecasts. An evaluation of the models forecasting performance is then conducted through the calculation of the Root Mean Squared Error (RMSE) between the predicted stock prices and the actual historical prices. This metric offers a measure of prediction accuracy, with lower values reflecting improved accuracy.
Lastly, the function visualizes both the historical data and the models forecasts, facilitating an easy comparison for users between actual stock prices and the forecasted values.
The necessity of this code is underscored by its application in investment decision-making. Utilizing predictive insights enables investors to optimize their buying and selling strategies based on anticipated future prices. The analytical methodology combines statistical analysis with historical data, allowing users to make informed decisions grounded in empirical evidence rather than conjecture. Additionally, performance metrics incorporated in the model evaluation assist users in understanding the reliability of the predictions offered.
arima_pred, arima_rmse = arima(stock_name, data)
print(arima_pred.shape)
This segment of code forms part of a comprehensive program centered on time series forecasting utilizing the ARIMA (AutoRegressive Integrated Moving Average) model. This statistical approach is commonly employed to predict future values in a time series based on historical data.
The code is designed to accomplish several key actions. It begins by invoking the function arima(stock_name, data), which suggests that it aims to fit an ARIMA model to the dataset pertaining to a specific stock identified by stock_name. This function likely analyzes historical stock prices or related information to apply the ARIMA methodology and generate forecasts.
Upon execution, the function returns two distinct outputs. The first is arima_pred, representing the predicted values produced by the ARIMA model. The second output, arima_rmse, refers to Root Mean Squared Error — an important metric that quantifies the differences between the generated predictions and the actual values. This provides insight into the accuracy of the model.
Additionally, the code includes a line that prints the shape of the predicted outputs. This action displays the dimensions of the arima_pred data structure, thus elucidating how many predictions have been made and their organizational format, whether it be one-dimensional, two-dimensional, or otherwise.
The overarching purpose of this code is to facilitate time series forecasting specifically in the context of financial data, such as stock prices. Achieving accurate forecasts can greatly aid investors and analysts in their decision-making processes.
Moreover, the inclusion of RMSE computation empowers users to evaluate the performance of the ARIMA model effectively. RMSE is a widely recognized benchmark, where a lower value signifies enhanced predictive accuracy.
Finally, printing the shape of the predicted output is crucial, as it helps users verify that the model has produced results in the anticipated format. This verification is essential prior to any further processing, such as graphical representations or comparisons with alternative models.
Given the significant disparity between the actual and predicted data, it may be advisable to explore alternative regression methods such as SARIMAX or VAR. I conducted tests with these methods; however, the outcomes were nearly identical to those obtained previously.
One possible explanation for this phenomenon could be the heightened market volatility during the prediction period. This volatility may hinder the ability of these models to adjust their forecasting trends on a daily basis. Additionally, if we apply these regression methods to older data, we may observe a reduction in error.