Plant Health Classification
Simon Fraser University (SFU) | CMPT 459: Data Mining | Fall 2025 | Taught by Dr. Martin Ester

Simon Fraser University (SFU) | CMPT 459: Data Mining | Fall 2025 | Taught by Dr. Martin Ester

This project uses machine learning, data analysis, and a custom classifier to predict the stress level of the plant. Our trained models can give automated warnings when a plant is likely experiencing stress. This is valuable because gardeners often notice issues too late, which can lead to smaller harvests or even plant loss. The data can be collected with affordable equipment such as soil probes, light sensors, and temperature monitors. Also, we can analyze this data to discover patterns and relationships, such as how soil quality, water levels, and other features affect growth and plant stress. In the future, the trained models could run on IoT devices for real-time monitoring. This would result in a smart garden assistant that supports healthier plants, reduces waste, and promotes sustainable living.
To analyse the plant health dataset, we began by performing exploratory data analysis techniques to inspect the dataset, visualize its structure, key features, and identify any data quality issues. We also visualized the dataset in lower dimensionality to better understand feature relationships. Clustering was then used as an additional exploratory tool to uncover natural groupings that were not obvious from the EDA process. By revealing overlaps, separations, and potential anomalies, clustering provides insight into the dataset’s structure and helps us make informed modelling decisions by how well the underlying classes are separated.
While clustering helped us to understand the natural structure of the dataset, it also highlighted irregular data points that did not fit well into any group. To address these irregular points, we implemented outlier-detection methods to identify and remove the readings that were likely caused by sensor noise or measurement errors. With these findings, we constructed a preprocessing pipeline for each classification algorithm used. This cleaned the dataset, handled missing values, normalized the features, and removed outliers. This ensured the classifiers in the next stage were trained on high-quality data without noise or inconsistencies.
After preprocessing, we decided to train the following classifiers: K-NN, Random Forest, and SVM. To ensure consistent results across classifiers, we pre-split the dataset (80/20) into train.csv and test.csv. For K-NN, we first remove outliers using LOF, then apply SelectKBest with Mutual Information to retain the top eight features. We then train the K-NN classifier using both default parameters and a hyperparameter-tuned configuration. On a separate working branch, knn-inference, we saved the final model, along with the encoder, scaler, and selected features for further inference demo.
For Random Forest, we followed a similar preprocessing strategy by cleaning the data, normalizing, and removing outliers using Isolation Forest. After preprocessing, we then encoded the target labels and applied Recursive Feature Elimination to identify the most important features for the model. With the reduced feature set, we performed hyperparameter tuning using GridSearchCV to find the best-performing configuration. We then trained the final Random Forest model using the optimal parameters and evaluated its performance through a classification report, confusion matrix, ROC curve, and feature importance plot. For SVM, similar pre-processing was used, normalizing the features. While outlier detection was performed, no significant outliers were found, so all samples were kept. Mutual information was used before training to identify optimal features. Since most features were redundant, we chose any features that had a mutual information higher than 0. Finally grid search was performed for hyper-parameter tuning. We found that the default SVM settings in Scikit-Learn were the optimal configuration. A confusion matrix was generated for analyzing performance, and ROC curves were computed for each class.
With K-NN using the top eight selected features, and with hyperparameter tuning. The classifier achieves 76% accuracy, compared to 70% without tuning. Without feature selection and with hyperparameter tuning, the performance reduces to 69%, and the lowest accuracy 65% occurs when neither tuning nor feature selection is applied. However, reducing the selected features to four significantly improves accuracy to 88%, highlighting the importance of both feature selection and hyperparameter optimization. Also, ROC analysis shows strong separation between Healthy and High Stress classes, while Moderate Stress cases are less confidently classified.
With Random Forest, the classifier with no feature selection and no hyperparameter tuning was already seeing almost perfect classification on this dataset. We saw the classifier achieving up to 96.7% accuracy right away. After applying both feature selection and hyperparameter tuning, we were still seeing 96.7% accuracy. Please refer to src/alex/random_forest.py and src/alex/random_forest_analysis.ipynb to view the full comparison across all Random Forest models.
With SVM, feature selection really helped to improve training accuracy, from 82% to 90%. Hyper-parameter tuning offered no improvement as the optimal hyper-parameters were the default ones. All classes are predicted quite well with Moderate stress being the lowest with an AUC of 98% however that number is already very high. Please refer to src/isaac/svm.ipynb
Stress detection appears to be hierarchical. The model easily distinguishes between extreme states (Healthy vs. High Stress) but struggles with Moderate Stress.
While we didn’t have any issues with data format or the validity of entries, training on our dataset was difficult because of the time dependence of our samples and our many uninformative features.
Most of our features had no relationship to our class labels, which meant that our training was inefficient, and the accuracy of our distance-based models, such as SVM or KNN, worsened. To mitigate this, we only selected features that had relationships with our class labels. Doing so allowed us to get very high accuracy on our test set. Further, we were dealing with time series data, which usually needs special feature extraction methods like wavelet transforms. Surprisingly, when we applied wavelet transforms to the data, our accuracy dropped and using the raw values was better than feature extraction, even though our data was periodic
Mutual Information Plot
Hourly Stress Distribution Plot
Realizing that feature extraction was not optimal for our data was important in achieving the high accuracy we did.
In conclusion, a key insight from the analysis for this specific dataset is that soil moisture strongly influences plant health, suggesting that for most home gardening scenarios, simply maintaining proper watering is sufficient. While feature extraction methods like wavelet transforms were explored, raw sensor values proved more effective, emphasizing the importance of data-specific preprocessing decisions. Overall, this work demonstrates that machine learning can provide real-world insights for plant care, and with future development, we could integrate IoT devices to better gather data and create a smart garden system to improve plant health, resulting in better yields and promoting sustainability.
The project code can be found on Github.