Linear regression

Linear regression is one of the simplest predictive models in machine learning, yet it remains surprisingly powerful. At its core, it attempts to capture a relationship between an input variable and a numerical output. If you can draw a reasonably straight line through a cloud of data points, linear regression will happily oblige by finding the line that “best fits” the trend.

Syllabus alignment

This lesson supports the NSW Software Engineering Stage 6 syllabus:

Software automation / Algorithms in machine learning
- Explore models of training ML, including supervised learning.
- Describe types of algorithms associated with ML, including linear regression.

In machine learning, linear regression plays a foundational role. Before we venture into neural networks, random forests or anything that sparkles with complexity, we must first understand how models learn relationships from data. Linear regression gives us a clean, interpretable example: a model with parameters (slope and intercept) that we can calculate directly, inspect openly, and evaluate without mystery.

Once fitted, the model can make predictions for new data points by mapping an input value onto the regression line. Of course, no model is perfect. Real-world data is messy and stubborn, and the difference between the predicted and actual values—the residual—is where accuracy is ultimately judged. Measures such as Mean Squared Error (MSE) and R² help quantify how well the model explains the variation in the data.

Despite its simplicity, linear regression teaches the essential vocabulary and logic of machine learning: patterns, error, optimisation, and model performance.

How linear regression works

The concepts above are best understood by seeing how a linear regression model behaves with real data rather than merely reading about it. The interactive below guides you through the modelling process step by step, using two contrasting housing datasets drawn from different parts of Sydney. As you scroll, you’ll explore how a scatter plot suggests a possible relationship, how the regression line is fitted, and how predictions are made from it.

You will then examine the model’s accuracy by analysing residuals, observing where the line succeeds and where it fails. Finally, you will compare two suburbs with very different data patterns to understand why some models perform well while others struggle.

Take your time with each step. The questions embedded throughout the interactive are designed to test your understanding as you go, not catch you out. By the end, you should have a clear sense not only of what linear regression does, but also why it matters in machine learning.

Key ideas

A linear regression model identifies a trend between numerical variables and represents it as a best-fit line.
Predictions are generated by projecting input values onto this line, giving an estimated output.
Residuals (prediction errors) reveal how closely the model follows reality; large residuals indicate poor fit.
MSE measures average squared error and is sensitive to large mistakes; lower values mean better accuracy.
R² expresses how much of the variation in the data the model explains; higher values indicate a stronger relationship.
Comparing different datasets or suburbs (as in the interactive) shows how context and variance influence model quality.
Linear regression forms the conceptual stepping-stone to more sophisticated machine-learning techniques.

Practice questions

Question 01

In one sentence, what is the main goal of linear regression in machine learning?

2 marks

Question 02

In the housing price interactive, identify one reasonable target variable and one feature that could be used to train a linear regression model.

2 marks

Question 03

Look at

(scatter plot with regression line). What does each dot represent, and what does the straight line represent?

2 marks

Question 04

Using

(residual plot), explain what a residual is and how large residuals appear on this chart.

2 marks

Question 05

Why is a model that perfectly passes through every training point not always desirable, even if its training error is almost zero?

2 marks

Question 06

Imagine you train two linear regression models on the same housing dataset. Model A has a higher mean squared error (MSE) but its residuals are evenly scattered around zero across all feature values. Model B has a slightly lower MSE, but its residuals are mostly positive for small houses and mostly negative for large houses. Which model would you trust more in practice, and why?

3 marks

Total: 0 / 13