Data features
Before a machine learning algorithm can make predictions or reveal patterns, it must be fed meaningful data. Yet data in its raw state is rarely suitable for analysis. It needs to be cleaned, transformed, and organised into a form a model can learn from. This preparation process revolves around features, such as measurable properties or attributes, that algorithms use to identify relationships and make predictions.
Syllabus alignment
This lesson supports the NSW Software Engineering Stage 6 syllabus:
-
Software automation / Programming for automation
- Design, develop and apply ML regression models using an OOP to predict numeric values.
A motivating scenario: preparing data for smart maintenance
Imagine you work for a company that manages a fleet of self-driving delivery vans. Each van streams sensor readings – such as battery temperature, motor current, GPS location, and vibration levels – at one-minute intervals. The maintenance team wants to predict which vehicles are likely to need service in the next week so they can schedule repairs before breakdowns happen. You receive a CSV exported from the sensor platform, only to discover missing timestamps, inconsistent units (some temperatures are in °C, others °F), and categorical status flags stored as free-text notes. Before any machine learning model can make sense of this messy data, you must decide what counts as a useful input feature and transform the raw measurements into a clean, consistent dataset.
What is a feature?
A feature is a variable or measurable input that helps describe an observation. In a dataset of houses, for example, each record might include numerical features such as floor area and the number of bedrooms, as well as categorical features such as location or material. Each feature provides information the algorithm can use to detect patterns, such as how price tends to rise with floor area or fall with distance from the city.
In a complex dataset like that generated by self-driving vans, features serve as the bridge between raw sensor data and meaningful insights. Each stream of measurements – battery temperature, vibration amplitude, or motor current – can be engineered into features that capture something predictive about the vehicle’s condition. For instance, rather than using every single vibration reading, you might calculate summary statistics such as the mean and standard deviation over an hour, or derive a new feature representing the rate of increase in battery temperature during long trips. Even categorical data, such as maintenance status notes, can be encoded as binary indicators indicating whether a fault code has recently appeared. The process of identifying, selecting, and transforming these features enables the algorithm to recognise early warning patterns and predict impending failures before they occur.
Key points when defining features
When constructing a dataset, every feature you include must earn its place. Each one should be chosen deliberately, guided by the question you want the model to answer. A feature is not merely a column in a spreadsheet; it is a piece of evidence you present to the algorithm in support of a hypothesis. Poorly defined features can lead to misleading results, while well-defined ones sharpen the model’s ability to discern meaningful relationships.
Relevance refers to the degree to which a feature contributes to the predictive goal. A model that predicts vehicle maintenance needs, for example, may benefit from sensor-derived features such as average motor current or the rate of change in battery temperature, but including irrelevant data, such as the driver’s favourite playlist, would add only noise. Selecting relevant features requires domain knowledge and critical analysis — the closer a feature aligns with the problem’s underlying mechanism, the stronger its contribution.
Measurability concerns the reliability and consistency of data collection. Features must be recorded consistently across all observations. If vibration levels are measured in different units or sampled at irregular intervals, the resulting inconsistencies will undermine the model’s learning process. Establishing clear measurement protocols or automated validation checks helps maintain data integrity.
Interpretability ensures that features are understandable to both humans and machines. A well-chosen feature should make sense to a domain expert — such as a mechanical engineer reviewing the “average daily vibration variance” — rather than existing as an abstract value with no intuitive meaning. Interpretability is particularly important when results must be explained to stakeholders, or when models are used to justify maintenance or safety decisions.
Granularity refers to the level of detail captured by each feature. Too much detail (such as per-second readings) may create unnecessary noise and computational load; too little (such as daily averages) can obscure meaningful patterns. The appropriate granularity depends on the nature of the process being modelled — predicting a motor failure may require minute-by-minute temperature trends, while forecasting service demand across a fleet might only require daily summaries.
Taken together, these principles form the backbone of effective feature definition. A feature that is relevant, measurable, interpretable and appropriately granular gives the algorithm a clearer view of the problem it is meant to solve — and positions your model to generate insights that are both accurate and actionable.
Types of features
Features can take several forms, each requiring different handling during preprocessing:
- Numerical (continuous) features represent real-valued measurements such as battery temperature or motor current. These can take any value within a range.
- Numerical (discrete) features represent countable values such as the number of fault codes recorded in the past week.
- Categorical (nominal) features describe labels without natural order, such as error code category or route ID. Algorithms require these to be converted to numeric form, typically via one-hot or label encoding.
- Categorical (ordinal) features have a meaningful order but no consistent differences between categories, such as severity ratings: low, medium, and high.
- Binary features represent two-state indicators, such as
is_rainingorrequires_software_update. - Textual features consist of unstructured strings, such as technician notes, which may require natural language processing techniques.
- Time-based features capture temporal information, such as timestamps or hours since last service.
Understanding the type of each feature guides the preprocessing steps you choose. An algorithm can only recognise structure in data if these features are represented consistently and on compatible scales.
Sample maintenance data
The following table shows a snapshot of features from the van fleet dataset, illustrating the different feature types in practice:
| Van ID | Battery Temp (°C) | Motor Current (A) | Fault Codes | Error Category | Severity | Is Raining | Route ID | Technician Notes | Hours Since Service |
|---|---|---|---|---|---|---|---|---|---|
| V-001 | 42.3 | 18.7 | 0 | None | Low | False | R-12 | Normal operation | 168.5 |
| V-002 | 78.9 | 24.1 | 2 | Motor | High | True | R-05 | Unusual vibration detected | 520.2 |
| V-003 | 35.6 | 16.2 | 0 | None | Low | False | R-08 | Battery voltage stable | 95.0 |
| V-004 | 65.4 | 22.8 | 1 | Sensor | Medium | False | R-12 | GPS intermittent | 340.8 |
| V-005 | 81.2 | 19.3 | 3 | Motor | High | True | R-03 | Requires immediate inspection | 612.5 |
This dataset demonstrates numerical features (battery temperature, motor current, hours since service), discrete features (fault codes), categorical features (error category, route ID), ordinal features (severity), binary features (is raining), and textual features (technician notes).
Feature scaling
Many algorithms, especially those that rely on distance measures or gradient descent, are highly sensitive to the scale of their input data. When features vary widely in magnitude, those with larger numeric ranges can dominate the model’s behaviour and distort its understanding of what truly matters. A dataset that includes “battery temperature in degrees” (perhaps between 20 and 80) and “distance travelled in metres” (possibly in the tens or hundreds of thousands) will lead the model to treat distance as far more significant simply because its numbers are larger. This can mislead algorithms such as k-nearest neighbour, linear regression, and neural networks.
To overcome this imbalance, features are rescaled using mathematical transformations that bring them to comparable ranges without changing their inherent relationships. Two common approaches are normalisation and standardisation.
Normalisation rescales feature values to a fixed range, usually between 0 and 1. This is useful when you know the bounds of your data or when features have a natural limit. For example, if the van’s battery charge level ranges from 0 to 100 per cent, each reading can be divided by 100 to express it as a proportion of full capacity. Normalisation is particularly effective when features represent rates, percentages, or probabilities.
Standardisation, by contrast, transforms values so that the feature has a mean of 0 and a standard deviation of 1. This method is better suited when features are unbounded or contain outliers, such as motor current or vibration amplitude. In this case, each reading is converted into the number of standard deviations it lies from the mean. A standardised vibration value of +2, for example, means that the van’s vibration level is two standard deviations above normal — a potential early sign of mechanical wear.
Scaling strategy for maintenance features
The choice between normalisation, standardisation, or no scaling depends on each feature’s characteristics:
| Feature | Type | Range/Bounds | Recommended Scaling | Rationale |
|---|---|---|---|---|
| Battery Temp (°C) | Continuous | Known bounds (0-100°C typical) | Normalisation | Temperature has physical limits; normalising to [0,1] preserves the bounded nature |
| Motor Current (A) | Continuous | Unbounded, may have outliers | Standardisation | Current can spike unpredictably; standardisation handles outliers better |
| Fault Codes | Discrete count | Lower bound (0), unbounded upper | Standardisation | Count data without a natural maximum; may contain outliers during failures |
| Error Category | Categorical | N/A | Encoding only | Requires one-hot or label encoding; no scaling needed after encoding |
| Severity | Ordinal | Fixed scale (Low/Medium/High) | Ordinal encoding | Encode as 1/2/3; already on consistent scale, normalisation optional |
| Is Raining | Binary | {True, False} | None | Already binary (0/1); no scaling required |
| Route ID | Categorical | N/A | Encoding only | Nominal category; one-hot encoding recommended, no scaling |
| Technician Notes | Text | N/A | NLP processing | Requires text vectorisation (e.g., TF-IDF) rather than numeric scaling |
| Hours Since Service | Continuous | Lower bound (0), unbounded upper | Standardisation | Can grow indefinitely; standardisation accounts for variability |
This analysis shows that battery temperature would benefit from normalisation due to its known physical bounds, while features like motor current and hours since service require standardisation to handle their unbounded nature and potential outliers.
The purpose of scaling is not to alter the physical meaning of features but to place them on an even footing. When all features share a common scale, the model’s optimisation algorithms can move efficiently through the data space, and each feature’s contribution reflects its actual predictive power rather than its numerical size.
In the smart maintenance scenario, properly scaled data allows the model to give balanced attention to battery temperature, vibration, speed, and usage duration, rather than fixating on whichever variable happens to be measured in the largest units. The result is a model that learns patterns rooted in reality, not in the quirks of measurement.
Feature selection and dimensionality
Not every feature improves a model’s performance; some add noise or redundancy. Feature selection identifies the most informative variables and discards those that confuse the algorithm. Too many features relative to the number of samples – a phenomenon known as the curse of dimensionality – can lead to overfitting, where the model learns the training data perfectly but performs poorly on new data. Simpler models with fewer, well-chosen features often generalise better.
From features to regression
In a regression problem, features serve as independent variables to predict a dependent variable, often called the target. In the self-driving van scenario, each feature, such as average battery temperature, motor current, total driving time, or recent vibration variance, provides a measurable input that may influence the target variable: the likelihood of a maintenance event in the coming week. The algorithm analyses how fluctuations in each of these features correlate with the probability of a fault, gradually learning a mathematical relationship between sensor readings and maintenance outcomes.
Understanding data features – types, scales, and interactions – is crucial to constructing reliable predictive models. In the next module, we’ll explore how these features can be combined using linear regression to model relationships between variables, quantify their influence, and make data-driven predictions about the future performance of complex systems, such as autonomous vehicles.