K-nearest neighbour classification

Syllabus alignment

This lesson supports the NSW Software Engineering Stage 6 syllabus:

Software automation / Algorithms in machine learning
- Explore models of training ML, including supervised learning.
- Describe types of algorithms associated with ML, including k-nearest neighbour.

K-nearest neighbour

The k-nearest neighbour (kNN) algorithm is one of the simplest classification models in machine learning. Unlike regression, which predicts a continuous value, kNN decides which category a new data point belongs to by looking at the data around it.

This chapter introduces kNN through an interactive tool that lets you explore how the number of neighbours (k), feature selection and dataset geometry influence classification boundaries.

What kNN does

kNN is described as an instance-based learner. Instead of building an explicit mathematical model, it stores the dataset and makes decisions by comparing new points with known examples.

To classify a new point, the algorithm:

Measures the distance between the new point and all points in the dataset
Selects the k closest points
Uses the majority class of those neighbours as the prediction

Why this matters

kNN classification supports the Software Automation outcomes by showing how ML systems make decisions using patterns in datasets. It exposes the limits of simple classifiers and highlights the role of data quality, dimensionality and bias.

Connections to the Software Engineering syllabus include:

Testing and evaluating ML solutions
Understanding how algorithms support efficient and accurate software
Investigating human and dataset-source bias in ML systems
Applying OOP principles to implement ML program structures

The interactive below unpacks the key concepts with examples.

Key concepts

Instance-based learning

kNN is a lazy learner. It does not build a model during training. Instead, it stores all training examples and defers computation until a prediction is requested. This makes training instant but prediction slower, especially with large datasets.

The role of k

The parameter k controls how many neighbours vote on the classification:

Small k (e.g., k = 1): Highly sensitive to noise and outliers; produces complex, irregular boundaries
Moderate k: Balances local patterns with stability
Large k: Produces smooth boundaries but may ignore local structure and misclassify edge cases

Distance metrics

kNN relies on measuring similarity between points. Common metrics include:

Euclidean distance: Straight-line distance; sensitive to scale differences between features
Manhattan distance: Sum of absolute differences; follows grid-like paths

Feature normalisation is critical. Without it, features with larger numeric ranges dominate the distance calculation and distort predictions.

Decision boundaries

Unlike logistic regression, which fits a single linear boundary, kNN adapts to the local structure of the data. This allows it to model non-linear, irregularly shaped class regions. However, this flexibility also makes it vulnerable to overfitting when k is too small.

Computational cost

kNN has no training phase, but prediction requires calculating distances to all training points. This becomes expensive with large datasets or high-dimensional feature spaces, a phenomenon known as the curse of dimensionality.

Practice questions

Question 01

Complete the sentence: "kNN is a supervised learning algorithm because it uses ________ to decide how to classify new data."

2 marks

Question 02

Complete the sentence: "To classify a new point, kNN looks at the ________ and chooses the class with the ________."

2 marks

Question 03

Using the interactive, set k = 1. Complete the sentence: "When k = 1, the decision boundary becomes ________ because ________."

3 marks

Question 04

Now set k = 7 (or 10). Complete the sentence: "When k is larger, the decision boundary becomes ________ because ________."

3 marks

Question 05

Complete the sentence: "The decision boundary represents the place where the classifier ________ between ________."

2 marks

Question 06

Based on your observations, complete the following: "A small value of k is useful when ________, but not useful when ________."

3 marks

Question 07

Complete this thought: "Choosing a good value of k matters because ________."

3 marks

Total: 0 / 18