K-nearest neighbour classification
Syllabus alignment
This lesson supports the NSW Software Engineering Stage 6 syllabus:
-
Software automation / Algorithms in machine learning
- Explore models of training ML, including supervised learning.
- Describe types of algorithms associated with ML, including k-nearest neighbour.
K-nearest neighbour
The k-nearest neighbour (kNN) algorithm is one of the simplest classification models in machine learning. Unlike regression, which predicts a continuous value, kNN decides which category a new data point belongs to by looking at the data around it.
This chapter introduces kNN through an interactive tool that lets you explore how the number of neighbours (k), feature selection and dataset geometry influence classification boundaries.
What kNN does
kNN is described as an instance-based learner. Instead of building an explicit mathematical model, it stores the dataset and makes decisions by comparing new points with known examples.
To classify a new point, the algorithm:
- Measures the distance between the new point and all points in the dataset
- Selects the k closest points
- Uses the majority class of those neighbours as the prediction
Why this matters
kNN classification supports the Software Automation outcomes by showing how ML systems make decisions using patterns in datasets. It exposes the limits of simple classifiers and highlights the role of data quality, dimensionality and bias.
Connections to the Software Engineering syllabus include:
- Testing and evaluating ML solutions
- Understanding how algorithms support efficient and accurate software
- Investigating human and dataset-source bias in ML systems
- Applying OOP principles to implement ML program structures
The interactive below unpacks the key concepts with examples.
Key concepts
Instance-based learning
kNN is a lazy learner. It does not build a model during training. Instead, it stores all training examples and defers computation until a prediction is requested. This makes training instant but prediction slower, especially with large datasets.
The role of k
The parameter k controls how many neighbours vote on the classification:
- Small k (e.g., k = 1): Highly sensitive to noise and outliers; produces complex, irregular boundaries
- Moderate k: Balances local patterns with stability
- Large k: Produces smooth boundaries but may ignore local structure and misclassify edge cases
Distance metrics
kNN relies on measuring similarity between points. Common metrics include:
- Euclidean distance: Straight-line distance; sensitive to scale differences between features
- Manhattan distance: Sum of absolute differences; follows grid-like paths
Feature normalisation is critical. Without it, features with larger numeric ranges dominate the distance calculation and distort predictions.
Decision boundaries
Unlike logistic regression, which fits a single linear boundary, kNN adapts to the local structure of the data. This allows it to model non-linear, irregularly shaped class regions. However, this flexibility also makes it vulnerable to overfitting when k is too small.
Computational cost
kNN has no training phase, but prediction requires calculating distances to all training points. This becomes expensive with large datasets or high-dimensional feature spaces, a phenomenon known as the curse of dimensionality.