Smart Farming: Data Processing Model Training Report

MachineLearningModelDevelopment&Methodology

Key Facts

Data split strategy: GroupShuffleSplit grouped by Plot ID to avoid leakage across dates.
Nitrogen prediction model: Gaussian Naive Bayes (65.5% accuracy), selected for stability and low overfitting risk.
Irrigation prediction model: Decision Tree (49.1% accuracy), selected to capture non-linear threshold logic.
Preprocessing: StandardScaler normalization; targets encoded via Label Encoding.
Operationalization: models + scalers + label encoders persisted as .pkl; inference pipeline replays identical preprocessing on XML + flight dates.
Known constraint: dataset scarcity limits generalization, especially for irrigation.

Overview

This report describes the training methodology, selection criteria, and operational processes for classification models trained on processed agricultural data (DAS, spectral statistics, plot IDs).

Model Training Strategy and Methodology

Group-based splitting (GroupShuffleSplit): group by Plot ID to prevent memorization when the same plot appears across dates.
Algorithm selection emphasizes suitability for limited data and agricultural logic over pure mathematical accuracy.
Standardization and labeling: normalize inputs via StandardScaler; encode targets via Label Encoding.

Key idea

Evaluate generalization on plots the model has never seen before by preventing Plot ID leakage across train/test.

Selected Models and Rationale

Nitrogen Prediction

Gaussian Naive Bayes — 65.5% accuracy

Selected for stability with limited observations and strong correlation between NDVI/spectral indices and nitrogen, reducing overfitting risk.

Irrigation Prediction

Decision Tree — 49.1% accuracy

Selected to capture non-linear threshold-driven effects of irrigation on plant morphology.

Model Management

Object persistence: serialize trained models, scalers, and label encoders to .pkl for reuse without retraining.
Inference pipeline: accept raw XML data and flight dates; apply identical scaling/encoding steps used in training.

General Evaluation and Constraints

The main bottleneck is volumetric scarcity of the dataset; irrigation accuracy is constrained by insufficient diversity for generalization.

Conclusion

Infrastructure is logically validated; higher accuracy is expected through retraining as data volume increases, without structural code changes.

Figures

Supplementary figures and visual materials

Loading image...

Cover page with report title and project code.

Cover

Smart Farming, Data Processing Model Training Report (23 January 2026).

Download Full Report

Access the complete report in PDF format

Data Processing Model Training Report

application/pdf•1.6 MB

How to Cite

Use the citation below to reference this report in your work

Kuru, E., & Bulut, M. A. (2026, January 23). Smart Farming: Data Processing Model Training Report (Project 2023-1-DE01-KA220-HED-000166720). Preunec.

Related Reports

Explore complementary research and documentation

Input Data Infrastructure and Data Processing Model Selection Report

Key Facts

Data split strategy: GroupShuffleSplit grouped by Plot ID to avoid leakage across dates.
Nitrogen prediction model: Gaussian Naive Bayes (65.5% accuracy), selected for stability and low overfitting risk.
Irrigation prediction model: Decision Tree (49.1% accuracy), selected to capture non-linear threshold logic.
Preprocessing: StandardScaler normalization; targets encoded via Label Encoding.
Operationalization: models + scalers + label encoders persisted as .pkl; inference pipeline replays identical preprocessing on XML + flight dates.
Known constraint: dataset scarcity limits generalization, especially for irrigation.