techTreksBooks

Data features

Before a machine learning algorithm can make predictions or reveal patterns, it must be fed meaningful data. Yet data in its raw state is rarely suitable for analysis. It needs to be cleaned, transformed, and organised into a form a model can learn from. This preparation process revolves around features, such as measurable properties or attributes, that algorithms use to identify relationships and make predictions.

Syllabus alignment

This lesson supports the NSW Software Engineering Stage 6 syllabus:

A motivating scenario: preparing data for smart maintenance

Imagine you work for a company that manages a fleet of self-driving delivery vans. Each van streams sensor readings – such as battery temperature, motor current, GPS location, and vibration levels – at one-minute intervals. The maintenance team wants to predict which vehicles are likely to need service in the next week so they can schedule repairs before breakdowns happen. You receive a CSV exported from the sensor platform, only to discover missing timestamps, inconsistent units (some temperatures are in °C, others °F), and categorical status flags stored as free-text notes. Before any machine learning model can make sense of this messy data, you must decide what counts as a useful input feature and transform the raw measurements into a clean, consistent dataset.

What is a feature?

A feature is a variable or measurable input that helps describe an observation. In a dataset of houses, for example, each record might include numerical features such as floor area and the number of bedrooms, as well as categorical features such as location or material. Each feature provides information the algorithm can use to detect patterns, such as how price tends to rise with floor area or fall with distance from the city.

In a complex dataset like that generated by self-driving vans, features serve as the bridge between raw sensor data and meaningful insights. Each stream of measurements – battery temperature, vibration amplitude, or motor current – can be engineered into features that capture something predictive about the vehicle’s condition. For instance, rather than using every single vibration reading, you might calculate summary statistics such as the mean and standard deviation over an hour, or derive a new feature representing the rate of increase in battery temperature during long trips. Even categorical data, such as maintenance status notes, can be encoded as binary indicators indicating whether a fault code has recently appeared. The process of identifying, selecting, and transforming these features enables the algorithm to recognise early warning patterns and predict impending failures before they occur.

Key points when defining features

When constructing a dataset, every feature you include must earn its place. Each one should be chosen deliberately, guided by the question you want the model to answer. A feature is not merely a column in a spreadsheet; it is a piece of evidence you present to the algorithm in support of a hypothesis. Poorly defined features can lead to misleading results, while well-defined ones sharpen the model’s ability to discern meaningful relationships.

Relevance refers to the degree to which a feature contributes to the predictive goal. A model that predicts vehicle maintenance needs, for example, may benefit from sensor-derived features such as average motor current or the rate of change in battery temperature, but including irrelevant data, such as the driver’s favourite playlist, would add only noise. Selecting relevant features requires domain knowledge and critical analysis — the closer a feature aligns with the problem’s underlying mechanism, the stronger its contribution.

Measurability concerns the reliability and consistency of data collection. Features must be recorded consistently across all observations. If vibration levels are measured in different units or sampled at irregular intervals, the resulting inconsistencies will undermine the model’s learning process. Establishing clear measurement protocols or automated validation checks helps maintain data integrity.

Interpretability ensures that features are understandable to both humans and machines. A well-chosen feature should make sense to a domain expert — such as a mechanical engineer reviewing the “average daily vibration variance” — rather than existing as an abstract value with no intuitive meaning. Interpretability is particularly important when results must be explained to stakeholders, or when models are used to justify maintenance or safety decisions.

Granularity refers to the level of detail captured by each feature. Too much detail (such as per-second readings) may create unnecessary noise and computational load; too little (such as daily averages) can obscure meaningful patterns. The appropriate granularity depends on the nature of the process being modelled — predicting a motor failure may require minute-by-minute temperature trends, while forecasting service demand across a fleet might only require daily summaries.

Taken together, these principles form the backbone of effective feature definition. A feature that is relevant, measurable, interpretable and appropriately granular gives the algorithm a clearer view of the problem it is meant to solve — and positions your model to generate insights that are both accurate and actionable.

Types of features

Features can take several forms, each requiring different handling during preprocessing:

  • Numerical (continuous) features represent real-valued measurements such as battery temperature or motor current. These can take any value within a range.
  • Numerical (discrete) features represent countable values such as the number of fault codes recorded in the past week.
  • Categorical (nominal) features describe labels without natural order, such as error code category or route ID. Algorithms require these to be converted to numeric form, typically via one-hot or label encoding.
  • Categorical (ordinal) features have a meaningful order but no consistent differences between categories, such as severity ratings: low, medium, and high.
  • Binary features represent two-state indicators, such as is_raining or requires_software_update.
  • Textual features consist of unstructured strings, such as technician notes, which may require natural language processing techniques.
  • Time-based features capture temporal information, such as timestamps or hours since last service.

Understanding the type of each feature guides the preprocessing steps you choose. An algorithm can only recognise structure in data if these features are represented consistently and on compatible scales.

Sample maintenance data

The following table shows a snapshot of features from the van fleet dataset, illustrating the different feature types in practice:

Van IDBattery Temp (°C)Motor Current (A)Fault CodesError CategorySeverityIs RainingRoute IDTechnician NotesHours Since Service
V-00142.318.70NoneLowFalseR-12Normal operation168.5
V-00278.924.12MotorHighTrueR-05Unusual vibration detected520.2
V-00335.616.20NoneLowFalseR-08Battery voltage stable95.0
V-00465.422.81SensorMediumFalseR-12GPS intermittent340.8
V-00581.219.33MotorHighTrueR-03Requires immediate inspection612.5

This dataset demonstrates numerical features (battery temperature, motor current, hours since service), discrete features (fault codes), categorical features (error category, route ID), ordinal features (severity), binary features (is raining), and textual features (technician notes).

Feature scaling

Many algorithms, especially those that rely on distance measures or gradient descent, are highly sensitive to the scale of their input data. When features vary widely in magnitude, those with larger numeric ranges can dominate the model’s behaviour and distort its understanding of what truly matters. A dataset that includes “battery temperature in degrees” (perhaps between 20 and 80) and “distance travelled in metres” (possibly in the tens or hundreds of thousands) will lead the model to treat distance as far more significant simply because its numbers are larger. This can mislead algorithms such as k-nearest neighbour, linear regression, and neural networks.

To overcome this imbalance, features are rescaled using mathematical transformations that bring them to comparable ranges without changing their inherent relationships. Two common approaches are normalisation and standardisation.

Normalisation rescales feature values to a fixed range, usually between 0 and 1. This is useful when you know the bounds of your data or when features have a natural limit. For example, if the van’s battery charge level ranges from 0 to 100 per cent, each reading can be divided by 100 to express it as a proportion of full capacity. Normalisation is particularly effective when features represent rates, percentages, or probabilities.

Standardisation, by contrast, transforms values so that the feature has a mean of 0 and a standard deviation of 1. This method is better suited when features are unbounded or contain outliers, such as motor current or vibration amplitude. In this case, each reading is converted into the number of standard deviations it lies from the mean. A standardised vibration value of +2, for example, means that the van’s vibration level is two standard deviations above normal — a potential early sign of mechanical wear.

Scaling strategy for maintenance features

The choice between normalisation, standardisation, or no scaling depends on each feature’s characteristics:

FeatureTypeRange/BoundsRecommended ScalingRationale
Battery Temp (°C)ContinuousKnown bounds (0-100°C typical)NormalisationTemperature has physical limits; normalising to [0,1] preserves the bounded nature
Motor Current (A)ContinuousUnbounded, may have outliersStandardisationCurrent can spike unpredictably; standardisation handles outliers better
Fault CodesDiscrete countLower bound (0), unbounded upperStandardisationCount data without a natural maximum; may contain outliers during failures
Error CategoryCategoricalN/AEncoding onlyRequires one-hot or label encoding; no scaling needed after encoding
SeverityOrdinalFixed scale (Low/Medium/High)Ordinal encodingEncode as 1/2/3; already on consistent scale, normalisation optional
Is RainingBinary{True, False}NoneAlready binary (0/1); no scaling required
Route IDCategoricalN/AEncoding onlyNominal category; one-hot encoding recommended, no scaling
Technician NotesTextN/ANLP processingRequires text vectorisation (e.g., TF-IDF) rather than numeric scaling
Hours Since ServiceContinuousLower bound (0), unbounded upperStandardisationCan grow indefinitely; standardisation accounts for variability

This analysis shows that battery temperature would benefit from normalisation due to its known physical bounds, while features like motor current and hours since service require standardisation to handle their unbounded nature and potential outliers.

The purpose of scaling is not to alter the physical meaning of features but to place them on an even footing. When all features share a common scale, the model’s optimisation algorithms can move efficiently through the data space, and each feature’s contribution reflects its actual predictive power rather than its numerical size.

In the smart maintenance scenario, properly scaled data allows the model to give balanced attention to battery temperature, vibration, speed, and usage duration, rather than fixating on whichever variable happens to be measured in the largest units. The result is a model that learns patterns rooted in reality, not in the quirks of measurement.

Feature selection and dimensionality

Not every feature improves a model’s performance; some add noise or redundancy. Feature selection identifies the most informative variables and discards those that confuse the algorithm. Too many features relative to the number of samples – a phenomenon known as the curse of dimensionality – can lead to overfitting, where the model learns the training data perfectly but performs poorly on new data. Simpler models with fewer, well-chosen features often generalise better.

From features to regression

In a regression problem, features serve as independent variables to predict a dependent variable, often called the target. In the self-driving van scenario, each feature, such as average battery temperature, motor current, total driving time, or recent vibration variance, provides a measurable input that may influence the target variable: the likelihood of a maintenance event in the coming week. The algorithm analyses how fluctuations in each of these features correlate with the probability of a fault, gradually learning a mathematical relationship between sensor readings and maintenance outcomes.

Understanding data features – types, scales, and interactions – is crucial to constructing reliable predictive models. In the next module, we’ll explore how these features can be combined using linear regression to model relationships between variables, quantify their influence, and make data-driven predictions about the future performance of complex systems, such as autonomous vehicles.

Practice questions

Question 1
Question 01

In one sentence, define a data feature.

2 marks
Question 2
Question 02

List two data quality issues uncovered in the smart maintenance CSV export.

2 marks
Question 3
Question 03

Why must every feature in a dataset 'earn its place'?

2 marks
Question 4
Question 04

Name the four principles highlighted for defining useful features.

2 marks
Question 5
Question 05

Which feature in the sample maintenance table is ordinal, and how is it ordered?

1 marks
Question 6
Question 06

Why do categorical features like Error Category or Route ID require encoding before modelling?

2 marks
Question 7
Question 07

When is normalisation preferred over standardisation in this lesson's scaling guide?

2 marks
Question 8
Question 08

Explain why hours since service should be standardised rather than normalised.

2 marks
Question 9
Question 09

What is the curse of dimensionality, and how does feature selection help?

3 marks
Question 10
Question 10

How do engineered features feed into the regression target in the maintenance example?

2 marks
Total: 0 / 20