A Complete Guide to Stock Market Analysis and Forecasting with Python

Explore stock market trends, risk, and correlation, and learn to build an LSTM forecasting model from scratch

Nov 13, 2025

∙ Paid

Link to download source code at the end of article!

Time series data is just a list of measurements taken over time, like daily stock prices. Working with time series is everywhere in data work, so learning to shape and read them is a basic and useful skill. It helps you spot trends, seasonality, and sudden changes that matter for forecasting.

In this notebook we’ll explore tech stocks — Apple, Amazon, Google, and Microsoft — to learn practical tricks. We’ll fetch their histories, make clear visuals with Seaborn and Matplotlib (these are plotting libraries that help you draw charts), and study risk from past behavior. We’ll also try to predict future prices using an LSTM — a kind of neural network that remembers patterns over time. Doing this gives you a full pipeline: from raw data to visuals to models, which is exactly what real forecasting projects need.

Along the way we’ll answer questions like: how did prices change over time; what was the average daily return; what are moving averages; how correlated are the stocks; how much value is at risk if you invest; and how we can attempt to predict future behavior (for example, predicting Apple’s closing price with an LSTM).

The first practical step is getting the data into memory. We’ll pull stock data from Yahoo Finance, a free and rich market-data source, using the yfinance library — it’s threaded so it can download multiple symbols efficiently, and Pythonic so it’s easy to use. See the article “Reliably download historical market data from with Python” for more details.

First, we want to know how the stock price changed over time — that means looking at its history to see rises, falls, and patterns. Understanding these changes helps us spot trends and volatility, which are the signals our forecasting models will try to learn from. This step gives a clear picture of past behavior so we can build better predictions.

In this section we’ll show how to request stock information using *pandas*, a Python library for working with tables (a DataFrame is just a smart table that makes rows and columns easy to use). You’ll fetch price and volume data and look at basic attributes like open, high, low, close, and volume — the daily numbers that describe trading. We’ll also cover simple checks and cleaning, because fetching accurate, well-formed data now prevents frustrating bugs later and prepares the dataset for feature creation and modeling.

!pip install -q yfinance

We’re building toward a machine learning pipeline that needs historical stock data to learn patterns and make forecasts; the single line here is about fetching the tool that fills our pantry with that market data. The exclamation mark at the start tells a Jupyter notebook that we’re stepping out of Python and into the system shell to run a command, like leaving the kitchen to fetch an ingredient from the store. pip is the package manager, the delivery service that brings external libraries into our environment, and the -q flag asks it to be quiet so the notebook output stays tidy. Key concept: dependency management is about making sure your work environment has the right external libraries so others (and future you) can reproduce the results. yfinance is the specific library we request — think of it as a curated aisle in the market that hands us historical stock prices and related fields from Yahoo Finance, which is the raw data our feature engineering and models will consume. Running that line doesn’t yet fetch any prices or train models; it simply ensures the tool that will fetch historical quotes is available, a small but essential first step in the larger forecast project where clean, reliable data is the foundation for meaningful predictions.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(’whitegrid’)
plt.style.use(”fivethirtyeight”)
%matplotlib inline

# For reading stock data from yahoo
from pandas_datareader.data import DataReader
import yfinance as yf
from pandas_datareader import data as pdr

yf.pdr_override()

# For time stamps
from datetime import datetime


# The tech stocks we’ll use for this analysis
tech_list = [’AAPL’, ‘GOOG’, ‘MSFT’, ‘AMZN’]

# Set up End and Start times for data grab
tech_list = [’AAPL’, ‘GOOG’, ‘MSFT’, ‘AMZN’]

end = datetime.now()
start = datetime(end.year - 1, end.month, end.day)

for stock in tech_list:
    globals()[stock] = yf.download(stock, start, end)

company_list = [AAPL, GOOG, MSFT, AMZN]
company_name = [”APPLE”, “GOOGLE”, “MICROSOFT”, “AMAZON”]

for company, com_name in zip(company_list, company_name):
    company[”company_name”] = com_name
    
df = pd.concat(company_list, axis=0)
df.tail(10)

We start by bringing the tools we’ll need: pandas and numpy for data handling, matplotlib and seaborn for pretty charts — seaborn.set_style and plt.style.use are like choosing a theme for our notebook so plots look consistent. The line that begins with % makes plots show up right inside a Jupyter notebook; it’s a little notebook magic to embed visuals.

Next we import helpers to fetch stock prices from the web and call yf.pdr_override() so the pandas_datareader functions use yfinance under the hood. Calling a function is like pulling out a reusable recipe card that does a specific job, and yf.download is one of those recipe cards for grabbing price histories.

We record timestamps with datetime so we can ask for exactly one year of data: end is “now” and start is the same day a year earlier. The tech_list holds the ticker symbols we want; the duplicate line just reaffirms the same list.

Then we loop over each ticker and call yf.download to fetch its historical prices and store the result in a variable named after the ticker via globals(); a loop is like repeating a recipe step for each item, and globals()[stock] is like labeling jars on a shelf with the ticker so you can grab them later. Each fetched object is a pandas DataFrame, which is like a spreadsheet with rows and columns holding the dates and price data.

We assemble those labeled DataFrames into company_list and create a parallel list of human-friendly company_name labels. The zip loop walks both lists together and writes a new column “company_name” into each DataFrame so every row knows which company it belongs to. Finally, pd.concat stacks all company tables vertically into one big DataFrame df, and df.tail(10) peeks at the last ten rows to verify everything looks right.

With a clean, combined dataset of past prices and company labels, we’re ready to move on to feature engineering and the machine learning models that will attempt to forecast future stock behavior.

The numbers in our file are all numeric and the date acts as the index, which means each row is labeled by a date. A DataFrame is just a smart table that holds those rows and columns, so having the date as the index makes it easy to pick ranges of time. You’ll also notice weekends are missing from the records — that’s normal for stock data because markets are closed on weekends. It’s useful to know this up front because forecasting models need consistent time steps, so you may later resample or fill those gaps.

Quick note about using globals() to name DataFrames: it’s simple and quick for small, throwaway scripts, but it’s a bit sloppy. It dumps variables into the global space, which can clutter things and lead to hard-to-find bugs. For experiments it’s fine, but for clearer, safer code you’ll usually prefer a dictionary or explicit variable names.

Now that the data is loaded, we should do some basic analysis and checks. This helps catch missing or weird values and reveals trends or patterns you’ll want your model to learn. Running these checks now saves time later when you start building the forecast.

The .describe() method gives descriptive statistics — short summaries that show central tendency (a typical value like the mean or median), dispersion (how spread out the numbers are, like the standard deviation), and shape (whether the distribution is lopsided or skewed). It skips NaN values, which means missing data points don’t change the summaries. This quick snapshot helps you spot outliers, missing data, or wildly different scales before you train a stock‑price model.

You can run .describe() on a Series (a single column) or on a DataFrame (a smart table with many columns). It works for numeric columns and for object columns like text, and when a table mixes types the output adapts to what you give it. Because the results vary with the input, check the output for each feature so you know whether to clean, scale, or transform that column for better forecasting.

# Summary Stats
AAPL.describe()

We’re trying to take a quick look at the AAPL data to understand its basic shape before we feed anything into a model. The comment line is just a human note saying “Summary Stats” and doesn’t affect execution; think of it as a sticky label reminding you what you’re about to read. The call AAPL.describe() asks the DataFrame to produce a one‑page report of descriptive statistics — count, mean, standard deviation, min, 25th/50th/75th percentiles, and max — for each numeric column. A key concept: summary statistics capture central tendency and spread — things like the average (mean) and quartiles — so you can quickly gauge typical values and variability.

That report is like a quick health check for your dataset: the count helps you spot missing values, the standard deviation and percentiles reveal how noisy or skewed a series is, and the min/max can highlight obvious outliers that might confuse a model. In the context of forecasting stock prices, this snapshot tells you whether you need to clean data, transform scales, or treat outliers before building features and training models, so your forecasting recipe has the best ingredients.

We have only 255 records in one year because weekends are not included in the data. By “records” I mean rows or daily price points, and weekends are missing because the stock market is closed on Saturday and Sunday — there are simply no trades to record.

This matters because you’ll have roughly 110 fewer samples than calendar days, which affects model training and creates gaps like the price jump from Friday to Monday that your model needs to handle.

In pandas, the .info() method prints a quick summary of a DataFrame. A DataFrame is just a smart table that holds your rows and columns. .info() shows the index *dtype* (the type of row labels, like integers or dates), the column names, how many non-null values each column has (that means how many entries are not missing), and the DataFrame’s memory usage (how much RAM it takes).

Running .info() early helps you spot missing data and weird column types before you start cleaning or training models. It also tells you if you should change data types to save memory, which can speed up training and avoid crashes on large datasets.

# General info
AAPL.info()

We’re at the gentle first step of the data journey: the comment tells us “General info,” like a signpost that says “let’s take a quick look around before we start cooking.” Calling AAPL.info() is like flipping to the table of contents and skimming the ingredient list for a recipe — it prints the index type, column names, how many entries each column actually has, the data type of each column, and roughly how much memory the table is using. A key concept here is exploratory data analysis: it’s the practice of summarizing a dataset’s main characteristics to guide next steps. Seeing non‑null counts helps you spot missing values that would need filling or dropping; seeing dtypes tells you whether dates are true timestamps or mere strings and whether numbers are stored as floats or objects that need conversion. Memory usage gives a heads-up about whether you’ll need to optimize storage for large time series. In a forecasting project, this quick inspection prevents surprises later — if Volume is missing half the time or Date isn’t a datetime index, your modeling pipeline would fail or mislead. After this peek, you’ll know whether to clean, convert types, set a proper time index, or engineer features, which leads directly into the preprocessing and modeling steps of the stock price forecast.

The closing price is the last price a stock traded at during the regular trading day, which means the hours when the market is officially open. Think of it as the day’s final snapshot of what buyers and sellers agreed the stock was worth.

Investors use the closing price as the standard benchmark — a common point of comparison — to track a stock’s performance over time. For forecasting, this is handy because it gives a consistent, easy-to-compare target across days and helps smooth out noisy intraday swings, making model results easier to interpret.

# Let’s see a historical view of the closing price
plt.figure(figsize=(15, 10))
plt.subplots_adjust(top=1.25, bottom=1.2)

for i, company in enumerate(company_list, 1):
    plt.subplot(2, 2, i)
    company[’Adj Close’].plot()
    plt.ylabel(’Adj Close’)
    plt.xlabel(None)
    plt.title(f”Closing Price of {tech_list[i - 1]}”)
    
plt.tight_layout()

We’re trying to give ourselves a historical view of adjusted closing prices for a few tech companies, laying them out like pictures on a wall so we can compare shapes and trends. The first line creates the canvas and sets its size to 15 by 10 inches so the plots won’t be cramped. The next line nudges the top and bottom margins so titles and labels have breathing room.

Then we start a loop that repeats a plotting recipe for each company; a loop is like following the same cooking step for each ingredient. enumerate(company_list, 1) hands us both the company data and a 1-based index so we can place plots in a human-friendly grid; using 1 as the start means the first index is 1 instead of 0, which is why we later subtract one when matching names. Inside each repetition, plt.subplot(2, 2, i) selects the i-th cell of a 2-by-2 grid so each company gets its own panel, like four frames on a gallery wall.

company[‘Adj Close’].plot() draws the time series of adjusted closing prices; adjusted close accounts for dividends and splits, so it’s the right series to compare true investor returns. The next two lines label the vertical axis and clear the horizontal label to keep things tidy, while plt.title(f”Closing Price of {tech_list[i — 1]}”) names the panel using the corresponding tech_list entry (note the i — 1 to convert the 1-based loop index back to the list index). Finally, tight_layout compresses spacing so nothing overlaps and the figure looks clean.

Seeing these historical patterns helps us choose features and sanity-check our forecasting models as we move into machine learning.

Volume is how much of something changes hands over a set time. An asset or security is just something you can buy or sell, like a company’s stock. For a stock, trading volume means the number of shares that changed owners between the market open and close. That daily count shows how active trading was that day.

Trading volume, and how it changes over time, matters to *technical traders* — people who study price charts and indicators to make decisions. We use volume because it tells us whether many participants back a price move, which helps confirm trends and shows liquidity, or how easy it is to buy or sell without shifting the price. For forecasting with machine learning, volume is a useful feature that adds context about market interest and the strength behind price changes.

# Now let’s plot the total volume of stock being traded each day
plt.figure(figsize=(15, 10))
plt.subplots_adjust(top=1.25, bottom=1.2)

for i, company in enumerate(company_list, 1):
    plt.subplot(2, 2, i)
    company[’Volume’].plot()
    plt.ylabel(’Volume’)
    plt.xlabel(None)
    plt.title(f”Sales Volume for {tech_list[i - 1]}”)
    
plt.tight_layout()

We’re trying to show how much stock changed hands each day for a handful of companies so we can visually compare trading activity. The first line calls plt.figure(figsize=(15, 10)), which takes out a wide canvas to draw on; a function is a key concept: it’s a reusable recipe card you call to perform a specific action like creating a figure. The next line, plt.subplots_adjust(top=1.25, bottom=1.2), nudges the canvas margins so titles and axes have room — think of stretching the paper margins before you start sketching.

Then we enter a for loop: for i, company in enumerate(company_list, 1):. A loop is a key concept: it repeats a recipe step for each company in the list. enumerate hands us both the company and its position, starting at 1 so the subplot indexing lines up with Matplotlib’s 1-based subplot numbers. Inside the loop plt.subplot(2, 2, i) picks one quadrant of a 2×2 grid as our current drawing area. company[‘Volume’].plot() sketches the time series of traded volume from that company’s data frame, like tracing daily bars on the chosen quadrant. plt.ylabel(‘Volume’) labels the vertical axis, plt.xlabel(None) clears a redundant x-label for visual clarity, and plt.title(f”Sales Volume for {tech_list[i — 1]}”) writes a descriptive title using the matching company name (note the i-1 adjustment because enumerate started at 1). Finally, plt.tight_layout() tidies spacing so the panels don’t overlap.

Seeing these volume patterns helps inform feature engineering and diagnostics for the stock price forecasting models.

Now that we’ve looked at the charts for the stock’s closing price and the number of shares traded each day (that’s the *volume*, or how many shares changed hands), we can move on to a simple next step.

Let’s calculate the *moving average* for the stock. A moving average is just the average of the last few days, so it smooths out the daily ups and downs and makes the underlying trend easier to see. This smoothing helps reveal whether the price is generally rising or falling, and it prepares the data for the forecasting models we’ll build next.

A *moving average* is a simple tool from technical analysis (that’s just a way to study past prices to spot patterns). It smooths out the ups and downs by taking an average of recent prices and updating that average every time a new price comes in — like averaging the last 10 days, or the last 20 minutes, or the last 30 weeks. Think of it like a running score that makes the overall direction easier to see.

You compute a moving average separately for each stock you’re watching, so you can compare their trends side by side. Traders pick the time period based on how fast they trade — short periods catch quick moves, long periods show long-term trends. We use moving averages because they cut the noise and give forecasting models a cleaner signal to learn from, which helps when predicting future prices.

ma_day = [10, 20, 50]

for ma in ma_day:
    for company in company_list:
        column_name = f”MA for {ma} days”
        company[column_name] = company[’Adj Close’].rolling(ma).mean()

fig, axes = plt.subplots(nrows=2, ncols=2)
fig.set_figheight(10)
fig.set_figwidth(15)

AAPL[[’Adj Close’, ‘MA for 10 days’, ‘MA for 20 days’, ‘MA for 50 days’]].plot(ax=axes[0,0])
axes[0,0].set_title(’APPLE’)

GOOG[[’Adj Close’, ‘MA for 10 days’, ‘MA for 20 days’, ‘MA for 50 days’]].plot(ax=axes[0,1])
axes[0,1].set_title(’GOOGLE’)

MSFT[[’Adj Close’, ‘MA for 10 days’, ‘MA for 20 days’, ‘MA for 50 days’]].plot(ax=axes[1,0])
axes[1,0].set_title(’MICROSOFT’)

AMZN[[’Adj Close’, ‘MA for 10 days’, ‘MA for 20 days’, ‘MA for 50 days’]].plot(ax=axes[1,1])
axes[1,1].set_title(’AMAZON’)

fig.tight_layout()

We’re trying to add simple trend lines to each stock and then show them in a four-panel picture so we can see how smoothed prices behave over time. First, a small list of window sizes (10, 20, 50) is created as the set of recipe lengths we’ll use for moving averages. Then a nested pair of loops repeats a step for each window and for each company: the outer loop picks a window size and the inner loop picks a company, like applying the same seasoning to every pan of vegetables. For each pair we build a label string such as “MA for 10 days” so the new column has a readable name, and then we compute company[‘Adj Close’].rolling(ma).mean() and store it under that label. A rolling mean computes the average of the last ma points at each time step, smoothing short-term fluctuations so longer trends stand out — that smoothing is often useful as a feature in forecasting models.

Next we create a 2×2 grid of plotting frames and set the figure height and width so the panels aren’t cramped, like laying out four photos on a poster and choosing the poster size. For each company (AAPL, GOOG, MSFT, AMZN) we select the adjusted close and the three moving-average columns and draw them into the appropriate subplot, then give that subplot a title with the company name so each panel is labeled. Finally we call tight_layout so labels and axes don’t overlap. The result is four clear trend-comparison charts that you can use to inspect patterns or feed smoothed features into your machine learning forecasting pipeline.

The graph shows that the best values to measure the moving average are 10 and 20 days. A moving average is just a smoothed version of the price that averages the last N days so you can see the trend more clearly, instead of every tiny up-and-down. Using 10 and 20 days smooths out random daily jumps but still follows the real trend.

For forecasting stock prices with machine learning, choosing 10- and 20-day moving averages gives your model cleaner signals and fewer distractions from noise. That makes it easier for the model to learn meaningful patterns and avoid overfitting to everyday randomness. This step prepares the data so later features and indicators will be more reliable.

We want to know the stock’s average daily return — in other words, how much the stock changed from one day to the next on average. A *daily return* is just the percent (or fraction) change from yesterday’s close to today’s close, so a positive number means the price usually went up and a negative number means it usually went down.

To find the average you calculate every day’s return and then take their mean (add them up and divide by the number of days). This gives a simple baseline for how the stock drifts over time, which helps when you build forecasting models because many of them expect to know the typical daily move. If you care about long-term growth instead of day-to-day behavior, you’d look at compounded (geometric) returns rather than the simple average.

Now that we’ve done some baseline analysis, let’s dive a bit deeper and focus on risk. We’ll look at daily returns — the percentage change in price from one day to the next — instead of raw prices. Returns show how jumpy the stock is and are what most risk tools expect, so they give a clearer picture of volatility.

We’ll use pandas, a Python library for working with tables (a DataFrame is just a smart table), to pull the daily returns for Apple stock. Getting these returns now prepares the data you’ll use to estimate volatility and feed into forecasting models.

# We’ll use pct_change to find the percent change for each day
for company in company_list:
    company[’Daily Return’] = company[’Adj Close’].pct_change()

# Then we’ll plot the daily return percentage
fig, axes = plt.subplots(nrows=2, ncols=2)
fig.set_figheight(10)
fig.set_figwidth(15)

AAPL[’Daily Return’].plot(ax=axes[0,0], legend=True, linestyle=’--’, marker=’o’)
axes[0,0].set_title(’APPLE’)

GOOG[’Daily Return’].plot(ax=axes[0,1], legend=True, linestyle=’--’, marker=’o’)
axes[0,1].set_title(’GOOGLE’)

MSFT[’Daily Return’].plot(ax=axes[1,0], legend=True, linestyle=’--’, marker=’o’)
axes[1,0].set_title(’MICROSOFT’)

AMZN[’Daily Return’].plot(ax=axes[1,1], legend=True, linestyle=’--’, marker=’o’)
axes[1,1].set_title(’AMAZON’)

fig.tight_layout()

Think of our little program as first measuring how each stock moves day to day, then laying those movements out on a four-picture wall so we can compare them. The for loop is like repeating a simple recipe for every company in company_list: company[‘Daily Return’] = company[‘Adj Close’].pct_change() takes the adjusted closing prices and computes the percent change from one day to the next; percent change is simply the relative jump from one value to the next and is a standard way to express returns. That assignment stores a new column called ‘Daily Return’ in each company’s table so we can work with those percentage changes as a time series.

Next we build a 2-by-2 canvas with fig, axes = plt.subplots(nrows=2, ncols=2), which creates a figure and a grid of plotting panels, and then set the overall size with fig.set_figheight(10) and fig.set_figwidth(15) so the pictures are readable. For each named company (AAPL, GOOG, MSFT, AMZN) we draw its Daily Return onto the appropriate panel: AAPL[‘Daily Return’].plot(ax=axes[0,0], legend=True, linestyle=’ — ‘, marker=’o’) tells pandas/matplotlib to plot that series on the top-left axes, show a legend, use a dashed line and mark points with circles. The axes[row,col].set_title calls simply label each subplot so we know which company we’re looking at. Finally fig.tight_layout() is like straightening the frames on the wall so titles and axes don’t overlap.

These plots give a quick visual sense of volatility and patterns in returns, which is exactly the kind of exploratory step you need before building forecasting models.

Great — let’s take a quick look at the average daily return with a histogram. A histogram is just a set of bars that shows how often different return values occur, so you can see if most days have small gains or losses or if a few days drive big moves. This step helps you spot outliers and the overall shape of the returns, which matters for choosing and testing forecasting models.

We’ll use seaborn to draw both a histogram and a KDE on the same figure. Seaborn is a friendly plotting library built on top of matplotlib that makes clean charts quickly. A KDE (kernel density estimate) is a smooth curve that approximates the distribution — like a smoothed histogram — so layering both gives you raw counts and the overall trend at once. That makes it easier to decide if the returns look normal, skewed, or heavy-tailed, which affects how you prepare data and pick algorithms.

plt.figure(figsize=(12, 9))

for i, company in enumerate(company_list, 1):
    plt.subplot(2, 2, i)
    company[’Daily Return’].hist(bins=50)
    plt.xlabel(’Daily Return’)
    plt.ylabel(’Counts’)
    plt.title(f’{company_name[i - 1]}’)
    
plt.tight_layout()

We want to make a small gallery that shows the distribution of daily returns for several companies so we can eyeball volatility and outliers before feeding features into our forecasting model. The first line sets up a canvas with a comfortable size so plots aren’t cramped — think of it as choosing a dinner table big enough to place four plates.

Next we start a loop that walks through each company one by one, like repeating a recipe step for every ingredient; using enumerate with a start of 1 gives us both the item and a 1-based index to place each plot correctly. Enumerate is a handy tool that pairs each element with its position in the list so you can use that position directly for layout.

Inside the loop we select a subplot position within a 2-by-2 grid, specifying the slot number so each company gets its own panel; a subplot is like assigning each dish to a specific plate on the table. Then we draw a histogram of the ‘Daily Return’ series with fifty bins, which aggregates returns into bars so you can see where values concentrate — histograms reveal the shape of the distribution and help spot skew and fat tails.

We label the x- and y-axes and set a title using the corresponding company name (note the index arithmetic to align the name list with the plot number), and after the loop we call a layout adjustment to tidy spacing so labels don’t overlap. Seeing these distributions grounds our feature choices and risk assumptions for the forecasting project.

We checked how different stocks’ closing prices moved together — that is, their correlation, which is just a number that tells you if two series rise and fall together (close to +1), move opposite each other (close to -1), or show no clear linear link (around 0). In practice we usually compute correlation on returns (day-to-day percent changes) rather than raw prices, because returns remove overall trends and give a clearer picture of how stocks co-move.

Knowing these correlations helps with forecasting because very similar predictors can confuse a model — a problem called multicollinearity, which means multiple inputs give the same information and can make estimates unstable. A quick visual like a correlation matrix or heatmap makes clusters of closely related stocks easy to spot, so you can decide whether to drop one, combine them, or use techniques like regularization or PCA to simplify inputs.

So the correlation check is a simple diagnostic step that prepares the data for better, more reliable machine-learning forecasts. It tells you which stocks carry similar signals and which add new information to the model.

Correlation is a statistic that shows how much two things move together. Its value runs from -1.0 (they move exactly opposite) to +1.0 (they move exactly the same way), with 0 meaning no clear linear relationship. Correlation measures association, but it doesn’t prove that one thing causes the other, and the link could be driven by a third factor. We look at correlation to spot which stocks tend to move together or opposite, which helps when choosing predictors for a forecast or managing risk.

If we want to compare many stocks, we usually work with *returns*, which are the percentage changes in price from one period to the next. Returns remove differences in scale between cheap and expensive stocks and make trends easier to compare. Using returns prepares the data for fair correlation calculations.

To analyze all the stocks, we’ll build a *DataFrame* — a smart table that holds each stock’s ‘Close’ prices in its own column. Putting every Close column together makes it easy to compute returns across all stocks and then a correlation matrix that summarizes how each pair moves together. This setup speeds up the calculations and keeps the data tidy for modeling.

# Grab all the closing prices for the tech stock list into one DataFrame

closing_df = pdr.get_data_yahoo(tech_list, start=start, end=end)[’Adj Close’]

# Make a new tech returns DataFrame
tech_rets = closing_df.pct_change()
tech_rets.head()

We want to collect the historical prices for a list of tech stocks and turn those prices into simple daily returns we can work with. The first line reaches out to Yahoo Finance via pdr.get_data_yahoo, passing the list of tickers and the start/end dates, and then selects the ‘Adj Close’ column so closing_df becomes a table where each column is a stock and each row is a date. Adjusted close is the closing price corrected for corporate actions like splits and dividends, so it gives a consistent series for true investor returns.

Next, tech_rets = closing_df.pct_change() turns those adjusted prices into percentage changes from one day to the next; percent change computes (today — yesterday) / yesterday and that single sentence captures the transformation into daily returns, which are often more stable for modeling than raw prices. Finally, tech_rets.head() peeks at the first few rows so you can eyeball the results and confirm the returns look sensible.

Together, these steps prepare the core numerical input — aligned, cleaned returns — for the forecasting pipeline, so we can feed consistent features into our machine learning models and start predicting future moves.

Now we can compare the daily percentage return of two stocks to see how correlated they are. Daily percentage return is just how much a stock’s price changed from one day to the next, shown as a percent. Being correlated means the two stocks tend to move together — up and down at the same time.

First, let’s compare a stock to itself. That should give a perfect correlation, so it’s a handy sanity check: if you don’t see a perfect match, something in the data or the code needs fixing. Starting here makes it easier to trust later comparisons between different stocks.

# Comparing Google to itself should show a perfectly linear relationship
sns.jointplot(x=’GOOG’, y=’GOOG’, data=tech_rets, kind=’scatter’, color=’seagreen’)

Here we ask the plotting library to compare Google to itself so we can visually confirm a simple truth: when you plot a series against itself you should see a perfect diagonal, which acts like a sanity check on our data. The function call comes from Seaborn (think of a function as a reusable recipe card) named jointplot, which draws a central scatter of x versus y plus the marginal distributions along the top and right — those marginals give you a quick sense of each variable’s spread. The x and y arguments point to the ‘GOOG’ column inside the DataFrame tech_rets, so every plotted point uses the same return value for both axes, and kind=’scatter’ means we place one marker per observation. Setting color=’seagreen’ just gives the markers a pleasant, consistent hue for visibility. Correlation is the key idea here: correlation measures how two series move together, and because each x equals its y counterpart you’ll get a perfectly linear relationship (points lying on a 45-degree line). If you saw anything else, it would flag data mismatches or indexing problems. This little visual check ties back to our larger forecasting work by confirming that our inputs are sane before we feed them into machine learning models to predict stock behavior.

# We’ll use joinplot to compare the daily returns of Google and Microsoft
sns.jointplot(x=’GOOG’, y=’MSFT’, data=tech_rets, kind=’scatter’)

You start with a human-readable note that says we’ll use joinplot to compare Google and Microsoft daily returns; comments are like sticky notes on a recipe card, there to remind you and your classmates of the intention. The actual instruction calls a Seaborn recipe card, accessed by the alias sns, named joinplot — think of a function as a reusable recipe card that, when followed, produces a visualization for you. By passing x=’GOOG’ and y=’MSFT’ you tell the recipe which two ingredients to compare, and by giving data=tech_rets you point it at the pantry shelf where those ingredients (columns of a DataFrame) live.

The kind=’scatter’ argument says “prepare a scatter plot” so each point represents one trading day with Google’s return on the horizontal axis and Microsoft’s on the vertical; a key concept: correlation measures whether two variables move together, and the scatter makes that relationship visible — points clustered along a diagonal suggest co-movement, while a cloud suggests independence. joinplot also supplies the marginal distributions along the axes, like tasting each ingredient on its own to understand its spread and outliers. Altogether, this visualization helps you spot linear relationships, outliers, and distributional quirks that inform feature choices and model assumptions for the larger stock-forecasting project.

If two stocks are perfectly and positively correlated, it means they move together in the same direction at a constant rate. On a scatter plot of their daily returns — daily returns are the day-to-day percentage changes in price — the points will line up along a straight line, showing that linear relationship clearly.

Seaborn and pandas make it easy to repeat that comparison for every pair in our technology stock ticker list, so you don’t have to plot each pair by hand. The function sns.pairplot() automatically builds a grid of pairwise scatter plots (and simple histograms on the diagonal), letting you spot which stocks are tightly linked. This quick view helps you decide which relationships matter for forecasting or for balancing risk in your model.

# We can simply call pairplot on our DataFrame for an automatic visual analysis 
# of all the comparisons

sns.pairplot(tech_rets, kind=’reg’)

Imagine you have a table where each column is the daily return of a different tech stock and you want a quick, visual tour of how each pair of stocks move together. The single commented line is a friendly note saying we can call a plotting function for an automatic visual analysis of all the comparisons, and the next line is the actual instruction to do that. A function call is like grabbing a reusable recipe card: you name the recipe (pairplot), hand it the ingredients (tech_rets, a DataFrame — a structured table of data), and tweak a setting (kind=’reg’) to change the flavor.

Pairwise comparison means looking at every possible two-column combination to see how they relate, and pairplot lays out a grid of small plots that pair each column with every other. With kind=’reg’ each scatter gets a fitted straight line so you can immediately see any linear trend; a regression line is a simple summary of the relationship that helps highlight correlation. Along the diagonal you typically get distributions that tell you how each return behaves on its own, while off-diagonal plots show whether two stocks move together, inversely, or independently.

Running this produces an exploratory visual map that helps you spot correlations, potential multicollinearity, or promising predictor relationships to feed into your forecasting models.

The chart above shows the daily-return relationships among all the stocks. A quick look reveals a notable correlation between Google and Amazon returns. It’s worth digging into that pair — understanding such links helps you choose features for models or spot stocks that move together, which matters for forecasting and managing risk.

Seaborn, a plotting library, makes this easy with sns.pairplot(), a single call that draws scatterplots for every pair of variables and simple histograms or density plots on the diagonal so you can see each stock’s distribution. That simplicity is great when you want a fast overview.

If you need more control, use sns.PairGrid(). It lets you pick exactly what plot goes on the diagonal, the upper triangle, and the lower triangle — the diagonal shows single-variable distributions, and the triangles show pairwise comparisons. That extra control is useful when you want regression lines on one side, density estimates on the other, or a cleaner figure for a presentation. Below is an example showing how to use PairGrid to get that polished result.

# Set up our figure by naming it returns_fig, call PairPLot on the DataFrame
return_fig = sns.PairGrid(tech_rets.dropna())

# Using map_upper we can specify what the upper triangle will look like.
return_fig.map_upper(plt.scatter, color=’purple’)

# We can also define the lower triangle in the figure, inclufing the plot type (kde) 
# or the color map (BluePurple)
return_fig.map_lower(sns.kdeplot, cmap=’cool_d’)

# Finally we’ll define the diagonal as a series of histogram plots of the daily return
return_fig.map_diag(plt.hist, bins=30)

We want a little visual photo album that shows how each stock’s daily returns relate to every other stock, so the first line creates that album by assigning a PairGrid to return_fig using the tech_rets DataFrame with missing rows removed; dropna is like wiping wet spots off the photos so nothing distorts the picture. The PairGrid is a reusable layout (a recipe card for plots) that lays out pairwise plots in a matrix so you can inspect every two-series relationship at once.

Next, return_fig.map_upper(plt.scatter, color=’purple’) tells the album how to paint the upper triangle: scatter points in purple, like plotting stars that show raw point-by-point relationships between pairs. Then return_fig.map_lower(sns.kdeplot, cmap=’cool_d’) instructs the lower triangle to use a kernel density estimate, which is a smoothing technique that draws contour-like regions of concentration — imagine a topographic map showing where the returns cluster — and the cmap argument picks the color theme. Finally, return_fig.map_diag(plt.hist, bins=30) fills the diagonal with histograms of each series’ daily returns, and the bins=30 choice controls how finely those distributions are chopped up so you can see skew, spread, and outliers.

A key concept: pairwise plots combine pointwise relationships and marginal distributions to reveal correlations and distributional quirks that guide feature selection. Seeing these relationships helps you decide which signals and transformations will be useful for the forecasting models you’ll build.

# Set up our figure by naming it returns_fig, call PairPLot on the DataFrame
returns_fig = sns.PairGrid(closing_df)

# Using map_upper we can specify what the upper triangle will look like.
returns_fig.map_upper(plt.scatter,color=’purple’)

# We can also define the lower triangle in the figure, inclufing the plot type (kde) or the color map (BluePurple)
returns_fig.map_lower(sns.kdeplot,cmap=’cool_d’)

# Finally we’ll define the diagonal as a series of histogram plots of the daily return
returns_fig.map_diag(plt.hist,bins=30)

We start by laying out the canvas: the first line creates returns_fig by calling a PairGrid on closing_df, which arranges a matrix of small plots so each column is compared pairwise with every other. Key concept: pairwise plots let you inspect relationships between every pair of variables at once, which is great for spotting correlations or odd patterns before modeling.

Next we paint the upper triangle of that matrix with scatter plots colored purple; a scatter is like plotting two lists of stock returns on X and Y axes to see whether points cluster along a line, hinting at linear relationships. The map_upper call assigns the same scatter recipe to every upper cell so the comparisons are consistent and easy to read.

Then we fill the lower triangle with KDE plots using a cool color map: a KDE is a smoothed estimate of where data points concentrate (key concept: kernel density estimation smooths discrete points into a continuous density curve so you can see the shape of the joint distribution). By mapping kdeplot to the lower cells you get contour-like views of density that complement the point clouds above.

Finally we define the diagonal as histograms with 30 bins, where each diagonal panel shows the distribution of a single series; histograms are like counting how many returns fall into each bucket, giving a sense of skew, spread, and outliers. Altogether, this visual table helps you discover which return relationships and distributions matter most as you design features and choose models for forecasting stock prices.

Finally, you can make a correlation plot to get actual numbers for how the stocks’ daily return values move together. A correlation plot is just a chart that gives a number showing how closely two things trend in sync, and a daily return is the percent change from one day’s closing price to the next. This is useful because numbers cut through the guesses — they tell you which stocks tend to rise and fall together, which helps when you build or test a forecasting model.

When we compare the closing prices — the price at the end of each trading day — Microsoft and Apple show an interesting relationship. That usually means they share similar trends or react to the same market forces, which matters for predictions and for avoiding models that get fooled by linked movements. Noticing this early helps you pick better inputs for forecasting and manage risk more thoughtfully.

plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
sns.heatmap(tech_rets.corr(), annot=True, cmap=’summer’)
plt.title(’Correlation of stock return’)

plt.subplot(2, 2, 2)
sns.heatmap(closing_df.corr(), annot=True, cmap=’summer’)
plt.title(’Correlation of stock closing price’)

Imagine we’re laying out a canvas for a bit of visual detective work: the first line creates that canvas and sets its physical size so the pictures will be large and readable (figsize controls width and height in inches). Calling subplot is like placing a framed photo into a 2×2 album slot; subplot(2, 2, 1) picks the top-left frame where the first image will go. The heatmap call paints a colored map of the pairwise correlations of tech_rets: correlation is a one-sentence key concept meaning how strongly two variables move together in a straight-line relationship. Annot=True writes the numeric correlation inside each square so you can read exact values, and cmap=’summer’ chooses a pleasant color palette so strong and weak relationships are easy to spot at a glance. The title labels this frame as the correlation of stock returns so your audience knows what they’re looking at.

We then move to the next frame with subplot(2, 2, 2) and draw a second heatmap of closing_df.corr(), which shows correlations between raw closing prices rather than returns, and again we annotate and style it identically before adding a descriptive title. Together these two visual “maps” let us compare how stocks relate under different transformations — an essential step for spotting multicollinearity, choosing features, and guiding the feature engineering that will feed the machine learning models for forecasting stock prices.

As we saw in the *PairPlot* — a chart that shows relationships between many pairs of variables at once — Microsoft and Amazon have the strongest correlation in their daily stock returns, meaning their day-to-day percent changes in price tend to move together. Seeing this both visually and in the numbers gives us confidence the plot matched our hunch. This matters for modeling because when two stocks carry the same signal, a model can end up counting the same information twice unless we adjust our features.

It’s also notable that all the technology companies are positively correlated, which means when one goes up, the others tend to go up too, and vice versa. Knowing this helps with forecasting and risk management: grouped moves can amplify gains or losses, so we should be careful about redundancy in inputs and about how diversified our predictions really are.

When we ask, “How much value do we put at risk by investing in a particular stock?” we mean how much money we might lose if that stock falls. That risk depends on how much of the stock you buy and how wildly its price swings — volatility is just a word for those swings. Measuring this helps you choose how big a position to take and how to protect yourself, like using limits on losses or diversifying into other investments.

You can express risk in simple ways, like a likely percent loss over a week, or in a formal way such as Value at Risk (VaR), which estimates the maximum loss you might expect over a set time with a given probability (that probability is just how confident you want to be). Forecasting models feed predicted price moves into these calculations so you get numbers instead of guesses. Remember, no measure is a guarantee; these tools simply make the uncertainty clearer so you can make better decisions.

There are many ways to measure risk, but a simple place to start is with the daily percentage returns we’ve collected. Daily percentage returns are just the percent change in price from one day to the next. We compare the expected return— the average of those daily percent changes — with the standard deviation — how much those daily returns swing around that average (that swing is what people mean by volatility).

Putting these two numbers side by side gives a basic sense of risk: a high average with small swings is healthier than the same average with wild swings. This quick check is useful before you build fancy models because it helps you balance reward versus volatility and decide which features or stocks deserve closer attention.

rets = tech_rets.dropna()

area = np.pi * 20

plt.figure(figsize=(10, 8))
plt.scatter(rets.mean(), rets.std(), s=area)
plt.xlabel(’Expected return’)
plt.ylabel(’Risk’)

for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
    plt.annotate(label, xy=(x, y), xytext=(50, 50), textcoords=’offset points’, ha=’right’, va=’bottom’, 
                 arrowprops=dict(arrowstyle=’-’, color=’blue’, connectionstyle=’arc3,rad=-0.3’))

We want to make a simple picture that shows each technology stock’s typical reward and how choppy its movements are, so you can quickly see the risk/return trade-off. First we clean the data with rets = tech_rets.dropna(): dropna removes missing values so our averages and spreads aren’t skewed by gaps — a tiny, important data-cleaning step. Then area = np.pi * 20 computes a constant for the marker size so all dots share a pleasant visual scale; think of it as choosing the size of stickers to mark each stock on the chart.

plt.figure(figsize=(10, 8)) opens a drawing canvas of a comfortable size. The scatter call plots each stock’s mean return on the x-axis and its standard deviation on the y-axis; mean is the average expected return and standard deviation is volatility or risk — key concept: standard deviation measures how spread out returns are, so higher means more unpredictable swings. The s=area argument tells matplotlib how big to draw each dot, and the xlabel/ylabel lines name the axes so the picture tells a clear story.

Next we loop like repeating a recipe step for every stock: for label, x, y in zip(rets.columns, rets.mean(), rets.std()): uses zip to pair each stock name with its mean and std — zip aligns multiple sequences element-wise. Inside the loop, plt.annotate writes the stock label near its dot, offset by (50,50) points and connected back with a small blue curved arrow, with ha and va controlling text alignment so labels sit neatly.

That plotted view gives an intuitive map of which stocks are high-return or high-risk, a useful visual check as you build and evaluate forecasting models.

We are predicting the *closing price* of Apple Inc. — the price of its stock at the end of a trading day. The closing price matters because traders and analysts use it as a stable reference point for valuing the stock. Saying this upfront helps you focus on one clear target for the model.

We’ll use machine learning to make those predictions. Machine learning means we teach a computer to spot patterns in past prices and other data so it can guess future prices. This step prepares us for testing and refining models, since a good prediction pipeline makes comparisons fair and repeatable.

Along the way we’ll pick which inputs to use, like past closing prices, volume, or simple indicators. Those inputs are important because better data usually leads to better predictions. Finally, we’ll evaluate the model on unseen days to see how well it would have worked in the real world, which helps decide if it’s useful for planning trades or research.

# Get the stock quote
df = pdr.get_data_yahoo(’AAPL’, start=’2012-01-01’, end=datetime.now())
# Show teh data
df

We’re trying to fetch historical Apple stock prices so we have the raw ingredients for a forecasting model. The first line asks a recipe card (a function) named pdr.get_data_yahoo to go out to Yahoo Finance and bring back every daily record for the ticker ‘AAPL’ between January 1, 2012 and the present; datetime.now() simply provides today’s date so the range is up to the moment. That call returns a DataFrame and we stash it in df — a DataFrame is like a spreadsheet where each row is a date and each column holds a different measurement such as Open, High, Low, Close, Volume, and Adjusted Close. Think of the network request as shopping for ingredients: the function fetches OHLCV data and delivers a neat table. The second line simply places that table on the workbench so you can look it over; in an interactive environment like a notebook, writing df displays the full table or a nicely formatted preview so you can inspect dates, missing values, and column names. Seeing the raw table is an important early check before you start cleaning, feature engineering, and feeding the series into your machine learning forecast.

plt.figure(figsize=(16,6))
plt.title(’Close Price History’)
plt.plot(df[’Close’])
plt.xlabel(’Date’, fontsize=18)
plt.ylabel(’Close Price USD ($)’, fontsize=18)
plt.show()

We’re trying to draw a clear picture of how the stock’s closing price has moved over time so we can spot trends before we build forecasting models. The first line, plt.figure(figsize=(16,6)), creates a plotting canvas and sets its size so the line won’t be cramped; a figure is the canvas on which all plot elements are drawn. Next, plt.title(‘Close Price History’) writes a headline across the top so anyone glancing at the chart immediately knows what they’re looking at, like a caption on a photograph. Then plt.plot(df[‘Close’]) traces the closing prices as a continuous line across the chart, using the DataFrame’s index (typically dates) for the horizontal axis; selecting df[‘Close’] pulls out that column as a pandas Series so the plotting function has the y-values to draw. After that, plt.xlabel(‘Date’, fontsize=18) and plt.ylabel(‘Close Price USD ($)’, fontsize=18) label the horizontal and vertical axes with larger text so the axes are self-explanatory, much like adding units to a graph. Finally, plt.show() renders and displays the finished image in your notebook or window so you can inspect it; show is the command that tells the plotting library to present the assembled figure. Seeing the closing price history this way helps you spot trends, volatility, and anomalies that will guide feature choices and model selection in the forecasting project.

# Create a new dataframe with only the ‘Close column 
data = df.filter([’Close’])
# Convert the dataframe to a numpy array
dataset = data.values
# Get the number of rows to train the model on
training_data_len = int(np.ceil( len(dataset) * .95 ))

training_data_len

We’re getting the dataset ready so the model can learn how closing prices move over time. The first line selects only the ‘Close’ column from the larger table of stock information, like taking just the price column off a full spreadsheet because our recipe only needs that ingredient. The next line turns that pared-down table into a plain numpy array — think of converting neatly labeled jars into a simple row of numbers the oven can work with. Key concept: numpy arrays are homogeneous, efficient containers for numerical computation, which most machine learning tools expect.

Then we compute how many rows will be used for teaching the model by multiplying the total number of rows by 0.95 and applying a ceiling, with int() making sure we have a whole-number count; this is like deciding that 95% of your practice problems will be used for study and rounding up to have a complete practice set. Key concept: separating data into training and testing sets prevents overfitting and helps evaluate model performance.

Finally, the bare name at the end prints or returns that training set length so you can confirm how many examples the model will learn from, analogous to checking the size of your prepared practice set before you begin. All together, these steps prepare the raw closing-price series and define the slice that the forecasting model will train on.

# Scale the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))
scaled_data = scaler.fit_transform(dataset)

scaled_data

We want the numbers describing past prices to play nicely with our learning algorithm, so the first line tells us we’re going to scale the data — think of it as resizing all ingredients so they fit a standard measuring cup. The import line pulls in MinMaxScaler from scikit-learn, a ready-made tool that will rescale each feature into a desired interval. Creating scaler = MinMaxScaler(feature_range=(0,1)) constructs the scaler object and sets the recipe: every feature will be squeezed into the range 0 to 1; that single sentence captures an important concept, feature scaling, which makes different-sized inputs comparable and helps optimization algorithms converge faster. When we call scaled_data = scaler.fit_transform(dataset) we’re doing two steps in one: fit measures the minimums and maximums in the dataset (like noting the smallest and largest ingredient amounts), and transform applies the linear rescaling so each value is mapped into 0–1. The result is a NumPy array of normalized values. Finally, simply writing scaled_data at the end will display those normalized numbers in an interactive environment, and we can later use scaler.inverse_transform to turn model outputs back into real price units. Normalizing the inputs this way is a small but essential step before feeding data to your forecasting model so training is stable and meaningful.

# Create the training data set 
# Create the scaled training data set
train_data = scaled_data[0:int(training_data_len), :]
# Split the data into x_train and y_train data sets
x_train = []
y_train = []

for i in range(60, len(train_data)):
    x_train.append(train_data[i-60:i, 0])
    y_train.append(train_data[i, 0])
    if i<= 61:
        print(x_train)
        print(y_train)
        print()
        
# Convert the x_train and y_train to numpy arrays 
x_train, y_train = np.array(x_train), np.array(y_train)

# Reshape the data
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# x_train.shape

We want to turn a long series of scaled prices into a set of short stories the model can learn from: each story uses the previous 60 days to predict the next day. The line that slices scaled_data into train_data simply picks the first chunk of the scaled series up to training_data_len so we only train on historical examples. Creating x_train and y_train as empty lists is like laying out two empty recipe boxes: one for the inputs and one for the labels.

The for loop that runs from 60 to the end repeats a short recipe step for each day: take the 60-day window immediately before day i and add it as an input, then take day i’s value and add it as the label; a sliding-window is a way to turn a time series into many supervised examples. Printing when i <= 61 is a quick peek at the first one or two recipes to confirm their shape — simple debugging to ensure the windows look right.

Converting the lists to numpy arrays gives the model a fast, numerical representation it can work with. The reshape rearranges the inputs into (samples, timesteps, features), the 3D shape recurrent networks expect; key concept: LSTM-style models need a three-dimensional input so they can see sequence length and feature channels. With these sequences prepared, they’re now ready to be fed into the forecasting model that will learn to predict stock prices from recent history.

from keras.models import Sequential
from keras.layers import Dense, LSTM

# Build the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape= (x_train.shape[1], 1)))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))

# Compile the model
model.compile(optimizer=’adam’, loss=’mean_squared_error’)

# Train the model
model.fit(x_train, y_train, batch_size=1, epochs=1)

We’re building a small neural network that can learn patterns in past prices and predict the next value. The first two lines bring in Sequential, which is a simple way to stack layers like laying out recipe cards in order, and the Dense and LSTM layer types that will do the heavy lifting.

Creating model = Sequential() starts us with an empty stack to which we add layers. The first model.add(LSTM(128, return_sequences=True, input_shape=(x_train.shape[1], 1))) adds an LSTM layer with 128 memory units — imagine a little librarian that remembers recent events; return_sequences=True means it hands over its memory at every time step, not just the final summary (a key concept: return_sequences=True produces an output for each time step). input_shape=(x_train.shape[1], 1) tells the layer how many time steps and features each sequence has, so it knows the shape of the incoming data. The next model.add(LSTM(64, return_sequences=False)) stacks another LSTM with 64 units that summarizes the sequence into a single state to pass onward.

model.add(Dense(25)) is a dense layer that mixes those features like combining ingredients, and model.add(Dense(1)) produces the final single-number prediction — the forecasted price. model.compile(optimizer=’adam’, loss=’mean_squared_error’) chooses Adam to update weights efficiently and mean squared error to measure how far predictions are from targets (MSE penalizes larger errors more). Finally, model.fit(x_train, y_train, batch_size=1, epochs=1) trains the network by showing it the data one sample at a time (batch_size=1) for one full pass (epoch).

Together, these steps form a temporal learner that models price dynamics; in production you’d usually tune sizes and train longer to improve forecasts.

# Create the testing data set
# Create a new array containing scaled values from index 1543 to 2002 
test_data = scaled_data[training_data_len - 60: , :]
# Create the data sets x_test and y_test
x_test = []
y_test = dataset[training_data_len:, :]
for i in range(60, len(test_data)):
    x_test.append(test_data[i-60:i, 0])
    
# Convert the data to a numpy array
x_test = np.array(x_test)

# Reshape the data
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1 ))

# Get the models predicted price values 
predictions = model.predict(x_test)
predictions = scaler.inverse_transform(predictions)

# Get the root mean squared error (RMSE)
rmse = np.sqrt(np.mean(((predictions - y_test) ** 2)))
rmse

We want to build a test set and see how well our trained model forecasts prices, so first we grab the scaled data starting a bit before the test period: taking scaled_data from training_data_len — 60 onward ensures each test example has the 60 prior days it needs. y_test is set to the original (unscaled) dataset from training_data_len to the end so we have the real prices to compare against.

x_test starts empty and the for loop walks from 60 to the length of test_data, appending slices of 60 rows at a time; think of the loop as repeating a recipe step to assemble many 60-day ingredient lists, where each appended slice is a sliding window of the most recent 60 values we feed the model. Converting x_test to a numpy array packs those windows into a single efficient structure, and reshaping to (samples, timesteps, 1) gives the model the 3D input it expects — key concept: LSTM layers want input shaped as [samples, timesteps, features], so we add the feature dimension of 1.

We call model.predict on x_test to produce scaled predictions, then use scaler.inverse_transform to convert those predictions back into dollar units the same way we originally scaled prices. Finally we compute rmse by taking the square root of the mean squared differences between predictions and y_test; key concept: RMSE is a single-number measure of average prediction error in the same units as the target, so it tells you how far off forecasts are on average.

This whole block is the test-time evaluation that tells you how well the forecasting model performs and guides the next improvements in the larger stock-price forecasting project.

# Plot the data
train = data[:training_data_len]
valid = data[training_data_len:]
valid[’Predictions’] = predictions
# Visualize the data
plt.figure(figsize=(16,6))
plt.title(’Model’)
plt.xlabel(’Date’, fontsize=18)
plt.ylabel(’Close Price USD ($)’, fontsize=18)
plt.plot(train[’Close’])
plt.plot(valid[[’Close’, ‘Predictions’]])
plt.legend([’Train’, ‘Val’, ‘Predictions’], loc=’lower right’)
plt.show()

Think of our goal as laying out a before-and-after picture so we can judge how well the model learned to follow the market’s footsteps. The line that takes the first slice of rows and calls it train is like cutting the first part of a loaf to practice recipes on; here we’re reserving historical examples the model saw during training. Splitting off the remainder into valid is like saving the rest of the loaf for a taste test later; a train/validation split is a key concept used to evaluate how well a model generalizes to unseen data. When we attach predictions to the valid set by creating a new Predictions column, it’s like pinning forecast sticky notes to the calendar so each predicted price lines up with its actual date — aligning predictions with real outcomes is essential for meaningful comparison. Creating a figure with a specific size is preparing a canvas so our lines will be clearly visible, and setting the title and axis labels is like naming the painting and labeling its axes so viewers know what they’re seeing. Plotting train[‘Close’] draws the historical prices the model learned from, while plotting valid[[‘Close’,’Predictions’]] overlays the actual future prices and the model’s guesses like two paths on the same map, letting us spot divergences. Adding a legend provides a key so we can tell which line is which, and show() finally hangs the picture on the gallery wall. Seeing these plots helps decide how to refine the forecasting approach.

# Show the valid and predicted prices
valid

We’re trying to pause and hold up the model’s report so we can inspect how well our predictions matched reality. The line beginning with # is a sticky note for humans — it’s ignored by the interpreter and simply tells anyone reading the file that the next thing should show the validation results. The lone name valid that follows acts like holding a finished page up to the class: in an interactive environment (like a notebook) evaluating that name prints whatever object it refers to, such as a table with actual prices and predicted prices side by side.

The object valid likely contains the validation set — actual stock prices and the model’s predicted prices — which is your quick way to eyeball performance; a validation set is data you set aside to evaluate how well the model generalizes to new examples, not used to train it, which is a key concept in honest model assessment. Seeing valid lets you scan rows, check discrepancies, and decide if you need further tuning or different features. If you were running a script instead of an interactive session, you’d explicitly print or log valid to achieve the same effect.

By holding up this comparison, you connect the model’s abstract numbers to real market movements, letting you judge whether the forecasting approach is genuinely learning stock behavior or just memorizing noise.

In this notebook you explored stock data to get ready for forecasting prices with machine learning. That groundwork helps your model understand what real market data looks like before you try to predict anything.

You loaded stock prices from Yahoo Finance using *yfinance*, a small tool that grabs historical market data for you. Fetching clean, reliable data first saves time later and makes your forecasts possible.

You looked at time-series data, which just means numbers arranged by date, using *Pandas* (a smart table tool), *Matplotlib* (basic plotting), and *Seaborn* (nicer statistical plots). Visualizing trends and outliers helps you choose the right features and spot problems before modeling.

You measured correlation, which tells you how two stocks move together — up, down, or independently. Knowing correlations helps when you pick inputs for models or when you care about diversifying investments.

You measured risk, meaning how wildly a stock’s price bounces around (volatility). Understanding risk helps set expectations for model performance and decide which stocks are suitable for forecasting or trading.

Download source code using link below!

Keep reading with a 7-day free trial

Subscribe to Onepagecode to keep reading this post and get 7 days of free access to the full post archives.