capstone-crop-yield

Crop Yield Estimator

Jordan Wheeler

Data Analytics Capstone Project 80/81FA23

Abstract

This project explores the application of machine learning to predict crop yields, focusing on ten crops and integrating data from organizations that report on global data. This project builds, trains and tests several machine learning models, including Linear Regression, Random Forest, Gradient Boost, Decision Tree, K-Nearest Neighbors, and Neural Networks. Data integrity is handled comprehensive preprocessing and analysis. The Gradient Boost model stands out for its accuracy in forecasting yields. The main findings indicate that crop type and regional factors influence yields the most, while climatic elements and pesticide usage have a surprisingly minimal impact. This project helps to establish a new outlook in agricultural predictive analytics and also challenges common assumptions in farming practices. It offers insights for sustainable agriculture, emphasizing the role of machine learning in enhancing crop production strategies. The research also recommends the inclusion of more diverse environmental variables like sunlight intensity and soil conditions to refine predictive models further for future use.

Project Goal

The goal of this project is to build a model that can determine the maximum crop yield that can be obtained given a set of parameters. This is a widely discussed topic within the Agriculture industry as improving crop yield reduces costs and helps provide sustanence for an ever growing population. This project focuses on trying to identify the best scenario to maximize crop yields while enabling a more sustainable form of Agriculture while reducing the environmental impacts. The full report can be read on Overleaf.

Introduction

This project followed a standardized approach to a Data Science topic as shown in the image below. It began by defining the scenario that the research would take place on. Once completed, data collection and preparation took place. After cleaning the data, it was then extracted, transformed, and loaded into a Jupyter Notebook file. From there, an exploratory analysis took place to tell what was happening with the data and get a better understanding of what we were working with. Once an understanding was had, model building and testing took place. The goal of this area was to find a model that would give accurate results and did not under or over fit the data. The models and their uses were explained and insights from the models were explored. Finally, the project looked at limitations with the models and future uses. For a more in depth review on these, please visit the Overleaf report.

Process

Files Used

Getting Started

Requirements

  1. Git
  2. Python 3.7+ (3.11+ preferred)
  3. VS Code Editor
  4. VS Code Extension: Python (by Microsoft)

Data Loading

Exploratory Data Analysis (EDA)

Model Building

Model Assessment

Conclusions

References

  1. Databank, T.W.: Climate change overview: Country summary (2023), accessed on October 20, 2023
  2. Food, of the United Nations, A.O.: Faostat (2023), accessed on October 20, 2023
  3. Geopard: Predicting crop yield with remote sensing data (2023), accessed: 2023-11-21
  4. Jupyter, P.: Project Jupyter: Open source software for interactive computing (2023), accessed: 2023-11-02
  5. OECD: Crop Production (2023), accessed on October 25, 2023
  6. Ritchie, H., Rosado, P., Roser, M.: Crop yields. Our World in Data (2022), accessed on October 20, 2023
  7. pandas development team, T.: pandas: Powerful data structures for data analysis (2023), accessed: 2023-11-02
  8. USDA: Food security status of U.S. households in 2022 (2023), accessed on October 25, 2023