Enhancing Financial Analysis and Prediction with Python
A Comprehensive Guide to Building Machine Learning Models for Stock Market Data
Download the source code from the link in comment section.
In the realm of financial markets, the ability to accurately analyze and predict stock movements is invaluable. With the advent of machine learning and data science, traders, analysts, and investors have at their disposal a powerful toolkit for deciphering market trends and making informed decisions. This guide delves into the practical application of Python in constructing and evaluating machine learning models tailored for the stock market. Through a series of functions and classes, we explore the process of preparing stock data, implementing prediction algorithms, and simulating trading strategies. By leveraging Python’s robust libraries such as NumPy, pandas, and scikit-learn, alongside custom statistical analysis techniques, we aim to provide a solid foundation for anyone looking to harness the predictive power of machine learning in financial markets.
Download the source code on my substack.
Intraday-240,3-LSTM.py
def makeLSTM():
inputs = Input(shape=(240, 3))
x = CuDNNLSTM(25, return_sequences=False)(inputs)
x = Dropout(0.1)(x)
outputs = Dense(2, activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(), metrics=['accuracy'])
model.summary()
return model
It provides a function that constructs and compiles a Keras model using the LSTM architecture, optimized for GPUs via PuDNNLSTMs. This model is designed for sequence processing, as indicated by the input shape (240, 3) where 240 likely represents the sequence length and 3 the feature dimension of each time step. As part of the function, an input layer is instantiated to receive input sequences. Afterwards, a single CuDNNLSTM layer with 25 units processes the sequences and its return_sequences parameter is set to False, which means it will only output the output from the last time step. A subsequent dropout layer with a rate of 0.1 prevents overfitting by randomly setting input units to 0. This process passes the processed sequences through two dense layers and a softmax activation function indicative of a binary classification approach. A probability distribution over two classes is the output of the Softmax layer. Next, the model is compiled with categorical cross-entropy loss function for multi-class classification, along with RMSprop optimizer and accuracy tracking. Finally, the function prints the model summary and returns the compiled model. LSTM for Nvidia GPUs is faster than LSTM in this model, which makes it appropriate for GPU acceleration.
def callbacks_req(model_type='LSTM'):
csv_logger = CSVLogger(model_folder + '/training-log-' + model_type + '-' + str(test_year) + '.csv')
filepath = model_folder + '/model-' + model_type + '-' + str(test_year) + '-E{epoch:02d}.h5'
model_checkpoint = ModelCheckpoint(filepath, monitor='val_loss', save_best_only=False, period=1)
earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=10, restore_best_weights=True)
return [csv_logger, earlyStopping, model_checkpoint]
Python function callbacks_req creates and returns a list of callback functions used in training machine learning models, specifically Keras/TensorFlow. Model_type, which is an optional parameter, defaults to LSTM. Besides indicating the type of model being trained, this is mainly used for personalizing file names. A callback function is instantiated within each Keras function: 1. CSVLogger: This callback logged epoch results. The file path is constructed using a variable model_folder, the string training-log- followed by the model type and a variable test_year, to which the .csv extension is appended. The second. A ModelCheckpoint callback saves the model at every epoch with this method. Unlike CSVLogger, this file path uses the string model- and inserts an epoch number into the filename. The period=1 argument ensures that the model is saved every epoch. This option is set to False, which means all epochs are saved and not just the best ones. Three. In this case, EarlyStopping stops training when a monitored metric (val_loss) no longer improves, by using the mode: min. The training session will end if the val_loss fails to improve over a stated patience period of 10 epochs. Restoring best weights means restoring the model weights to those of the epoch with the best value of val_loss. Finally, the function outputs a list of these three callbacks, which can be passed to the training method (e.g., fit or fit_generator). Several variables are used in the code snippet, but they are not defined in the function. It is likely that these variables are set elsewhere or as global variables.
def reshaper(arr):
arr = np.array(np.split(arr, 3, axis=1))
arr = np.swapaxes(arr, 0, 1)
arr = np.swapaxes(arr, 1, 2)
return arr
The provided code is a function named reshaper which transforms an array called arr using NumPy. The array is first converted to a NumPy array if it has not already been one. It is then split into 3 subarrays along the second dimension (axis=1), creating a new array with an additional dimension compared to the original one — virtually, the new array has the same 3 columns. After that, two •swapaxes• operations are performed. Swapaxes swaps the first and second dimensions of the resulting array resulting from split. Second, the swapaxes call swaps the now second dimension with the third dimension. The returned array is reshaped and has its axes swapped. This function assumes that the input array arr has a shape that can be evenly split into 3 parts along its second axis; otherwise, the np.split operation would raise an error.
def trainer(train_data, test_data):
np.random.shuffle(train_data)
(train_x, train_y, train_ret) = (train_data[:, 2:-2], train_data[:, -1], train_data[:, -2])
train_x = reshaper(train_x)
train_y = np.reshape(train_y, (-1, 1))
train_ret = np.reshape(train_ret, (-1, 1))
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(train_y)
enc_y = enc.transform(train_y).toarray()
train_ret = np.hstack((np.zeros((len(train_data), 1)), train_ret))
model = makeLSTM()
callbacks = callbacks_req()
model.fit(train_x, enc_y, epochs=1000, validation_split=0.2, callbacks=callbacks, batch_size=512)
dates = list(set(test_data[:, 0]))
predictions = {}
for day in dates:
test_d = test_data[test_data[:, 0] == day]
test_d = reshaper(test_d[:, 2:-2])
predictions[day] = model.predict(test_d)[:, 1]
return (model, predictions)
Code defines the function train that makes predictions on test data after training a machine learning model. There are two inputs to this function: train_data and test_data. To begin, the trains are shuffled. It then splits the training data into inputs (train_x), target labels (train_y), and additional information (train_ret) by slicing the data array. An input function called reshaper is used to reshape the data; the function is not defined within the given code. The target labels (train_y) are one-hot encoded using Scikit-Learns OneHotEncoder, which isnt explicitly imported in the code, to handle categorical data. It is then reshaped and stacked horizontally with a zero column to presumably fit the model input. We create a model named makeLSTM() that indicates this may be a Long Short-Term Memory (LSTM) model tailored for time-series data. It involves 10,000 epochs in training, a validated split percentage of 20%, and 512 batches with callback functions. The exact implementation details for makeLSTM() and callbacks_req() are not provided here, implying they are custom functions. Once the model has been trained, iterate over the dates in the test_data to prepare and predict each date. The function filters the test_data by day and reshapes it using the reshaper. As a result of this reshaped test data, the model makes predictions, and places the second column of the predictions in a dictionary sorted by the date. Finally, the function returns both the trained model and a dictionary of predictions. Despite running in a Python environment and having libraries like numpy for data manipulation and deep learning frameworks like LSTM, this code lacks import statements and doesnt include function definitions for reshaper, makeLSTM, and callbacks_req().
def trained(filename, train_data, test_data):
model = load_model(filename)
dates = list(set(test_data[:, 0]))
predictions = {}
for day in dates:
test_d = test_data[test_data[:, 0] == day]
test_d = np.reshape(test_d[:, 2:-2], (len(test_d), 240, 1))
predictions[day] = model.predict(test_d)[:, 1]
return (model, predictions)
Training includes three arguments: filename, which is an expectec file path to a trained model, train_data, and test_data, which are datasets for training and testing. By calling load_model, the function loads a pre-trained model from the given filename. A unique date is extracted from the first column of the test_data dataset. An empty dictionary called predictions is created to store the predicted results. Following that, it iterates over each unique date found in the test data, filters it to include only entries for the current date in the loop, and reshapes it. During reshaping, each sequence is created with 240 elements and a single feature, so that the model can predict it. Based on the reshaped data, the model predicts that it will output a two-dimensional array of predictions, where it extracts the second element from every prediction (at index 1) and stores it in the prediction dictionary, with the date as the key. Finally, the function returns a tuple containing the loaded model and the dictionaries of predictions. A function assumes that all functions and modules used like load_model and numpy (NumPy) are available in the scope where the function is run, but this code snippet is missing those details.
def simulate(test_data, predictions):
rets = pd.DataFrame([], columns=['Long', 'Short'])
k = 10
for day in sorted(predictions.keys()):
preds = predictions[day]
test_returns = test_data[test_data[:, 0] == day][:, -2]
top_preds = predictions[day].argsort()[-k:][::-1]
trans_long = test_returns[top_preds]
worst_preds = predictions[day].argsort()[:k][::-1]
trans_short = -test_returns[worst_preds]
rets.loc[day] = [np.mean(trans_long), np.mean(trans_short)]
print('Result : ', rets.mean())
return rets
The code defines a function called simulate, which takes two argument: test data and predictions. For long (buy) and short (sell) positions, the function calculates the mean returns by taking the top (buy) and bottom (sell) predicted values for each daily period. To manipulate data and calculate statistics, it uses Pandas and NumPy. Specifically, test_data is assumed to be a NumPy array with dates as the first column and returns as the penultimate. This will be a dictionary with keys representing days and values representing an array of predictive scores for assets reported that day. The function initializes an empty DataFrame rets with columns Long and Short. In the following iteration, the computational engine sorts the predictive scores and takes the top k (best) predictions for long positions and the bottom k (worst) predictions for short positions. In the DataFrame rets, it retrieves corresponding returns from the test data, negates returns for short positions, calculates the mean for long and short positions, and stores the results. The DataFrame containing the daily mean long and short returns is returned once the method iterates through all days.