The plot shows a visual representation of the relationship between training and validation loss

Through the systematic exploration of these factors, they conducted a total of 60 experiments to ascertain the optimal combination of architectural configurations. While the study from authors Mohanty et al. primarily focused on deep convolutional neural networks, subsequent research has demonstrated that convolutional neural networks are not the sole approach to achieving excellent performance in image classification tasks. These claims come from the paper “An Image is worth 16X16 Words: Transformers for Image Recognition at Scale” by Alexey Dosovitskiy et al. Their paper finds that employing a pure transformer applied directly to sequences of images, when pre-trained on substantial volumes of data and transferred to multiple mid-sized or small image recognition benchmarks such as ImageNet, CIFAR-100, or VTAB, can yield highly competitive outcomes. The Vision Transformer architecture has showcased remarkable performance compared to state-of-the-art convolutional networks, while also significantly reducing the computational resources required for training. Consequently,wholesale grow bags the motivation for this project is to loosely follow the framework outlined in the study by Mohanty et al. However, instead of employing a deep convolutional network architecture, a Vision Transformer model pre-trained on the ImageNet-21k data set will be utilized to implement transfer learning and to train a disease classification model with the project Plant Village data set.

The data for this project is the project Plant Village data which was found through the paper, “Using Deep Learning for Image-Based Plant Disease Detection”. The data consists of 54,306 color images of healthy and diseased crop leaves. In a machine learning sense, our data set of 54,306 images is considered small. Each image is the size of 256 × 256 pixels, has the three color channels RGB, and is categorized under a Crop Disease Classification Label. Each label has a crop species name and a plant disease name or healthy. There are 14 different crop species and 20 different crop diseases, which create 38 different crop-disease Classification Labels in this data set. See Table 2.1 for a detailed list of all Classification Labels. Table 2.1 shows the total number of images each Classification Label contains and how that amount is translated to be the overall percentage contribution to the data set. Classification labels with over 5,000 images had the largest percent contributions to the data set. These labels are Orange-Haunglongbing with 10.1%, Tomato-Yellow Leaf Curl with 9.9%, and Soybean-Healthy with 9.4%. The classification labels with less than 500 images and having the least amount of percent contributions to the data set are Peach-Healthy, Raspberry-Healthy, and Tomato-Mosaic Virus, all with 0.7%, Apple-Cedar Apple Rust with 0.5%, and Potato-Healthy with 0.3%.Some crop species in this data set have healthy and diseased leaf images. However, some crops only have diseased images or only healthy images. Not all crop species in this data have contributed to the healthy crop images, and not all crop species have contributed to the diseased images.

For a more detailed view of the 20 crop diseases included in this data set, see Table 2.3, which has the summed amount images for each disease in this data set and the percentage contribution to the overall data set. The top three diseases that compose this data set are Bacterial Spot, Haunglongbing, and Yellow Leaf Curl. Bacterial Spot, which composes 10% of the data, is a bacterial disease that affects many crops by causing their leaves to develop yellow spots that turn brown in the middle. It also causes crops to develop black or brown spots of rot on their fruits. Haunglongbing composes 10% of the data set. This bacterial disease affects citrus trees causing their fruits to stay green and fall to the ground early before becoming ripe. This disease is common for citrus, but keep in mind that this data set only has images of this disease affecting Oranges. Yellow Leaf Curl composes 9.9% of the data set and is a Viral infection that only affects Tomatoes. “Yellow leaf curl virus is undoubtedly one of the most damaging pathogens of tomatoes, and it limits the production of tomatoes in many tropical and subtropical areas of the world. It is also a problem in many countries that have a Mediterranean climate, such as California. Thus, the spread of the virus throughout California must be considered a serious potential threat to the tomato industry” . Note that the total percentage of diseased images contributing to this data set is 72.2% because the other 27.8% of this data set is healthy crop images. See Table 2.4 for a comparison of the amount of diseased and healthy crop images in this data set. The crops that contributed to the diseased images are Apple, Bell Pepper, Cherry, Grape, Maize, Orange, Peach, Potato, Strawberry, Squash, and Tomato.

The crops that contributed to the healthy images are Apple, Bell Pepper, Blueberry, Cherry, Grape, Maize, Peach, Potato, Raspberry, Strawberry, and Tomato. For a visual representation of what the diseased and healthy crop images look like, see Figure 2.1 for nine different crop images. The crop images have their classification labels above them to identify the crop name and disease or healthy. The images in Figure 2.1 are crop images of classification labels Apple-Apple Scab, Apple-Healthy, Peach-Healthy, GrapeHealthy, Raspberry-Healthy, Soybean-Healthy, Grape-Black Rot, and Peach-Bacterial Spot. While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. This was the motivation for Dosovitskiy et al. to look into the implementations of the transformer model for image classification tasks, and the vision transformer was created. The transformer architecture for natural language processing tasks works similarly to a vision transformer. In natural language processing, sentences are broken down into words. Then each word is treated as a sub-token of the original sentence . Similarly, the vision transformer breaks down an image into smaller patches, each patch representing a small sub-section of the original image. To visually see how sentences are broken down into word tokens and images broken down into patches, see Figure 2.2 from . Keep in mind that the positionof the image patch is very important. If the image patches are out of order, then the original image will also be out of order. This project will implement the vision transformer developed by Dosovitskiy et al. and mentioned in their paper. The framework of their ViT model will be used and accessed through the Hugging Face platform and their package transforms using Python . The vision transformer model comes pre-train on the ImageNet-21k,grow bags for gardening a benchmark data set consisting of 14 million images and 21k classes. The vision transformer model has been pre-trained on images with pixel size 224 × 224. Therefore, any data that is to be further trained on this model must also be of pixel size 224 × 224 .“Data augmentation is the process of transforming images to create new ones for training machine learning models”. “Data augmentation increases the number of examples in the training set while also introducing more variety in what the model sees and learns from. Both these aspects make it more difficult for the model to memorize mappings while also encouraging the model to learn general patterns. Data augmentation can be a good substitute when resources are constrained” because it artificially creates more of your data when it is not possible to get more data. In the case of this project, the function being used to perform the data augmentations is set transfrom from the Hugging Face Datasets package in Python. This function performs data transformations only when the model training begins. Therefore, transformations can be done on the fly and save on computational resources. Then at each epoch, the transformations are applied to every image given to the model, so the amount of training data stays constant, but variation is added to the original data through transformations. This does not increase the number of training images as other data augmentation packages would, this artificially augments with transformations and variation.

Data augmentation is an important step when training machine learning models because they can perform very powerfully if the given data sets for training are too small to train with. These models can start to over-fit, which is a problem because then the model will memorize mappings between the inputs and expected outputs. There are 54,306 images in this data set, which may seem like a lot of images, but for a machine learning model, it is not that much. That is why data augmentation is being implemented as a step to reduce possible model over fitting. See Table 3.1, for training results of the pre-trained ViT Image classification model trained on the Plant Village data set. The table displays the model’s Training Loss, Validation Loss, Precision, Recall, model F1 score, and model Accuracy over the 10 epochs. Model training loss indicates how well the model is fitting the training data. Model validation loss indicates how well the model fits new data. The best model that was chosen was epoch 10, which is bold in Table 3.1. The model from epoch 10 has a training loss of 0.088 with a validation loss of 0.073, the lowest pair out of all the epochs. For this model, Precision, Recall, the model F1 Score, and Accuracy all have the value of 1.00. These results indicate that the model has a high positive identification rate. The model seems to have reached a convergence in values between epochs 8 to 10. The training results for epoch 10 showed that the Evaluation of Samples per Second for the model is 50.29. This indicates that number of samples the machine learning model can process and make predictions of in one second is 50.29. Also, the Evaluation Steps per Second is 1.58, which indicates the number of iterations that the machine learning model can complete in one second. See Figure 3.1 for a comparison plot between the model’s training loss and validation loss over the course of the 10 epochs. Of the course of the 10 epochs, the training and validation loss are both decreasing. The lines do not cross each other. Since both lines are gradually decreasing and getting closer to 0, it indicates that over time the model is slowly learning the underlying patterns in the data. Model testing was done with the 15% of the data that was reserved for testing and never shown to the model during training. See Table 3.2, in order to see the overall testing results of the model in a table. The model has an Accuracy of 99%. Meaning the model has the ability to guess the classification label 99% of the time correctly. Total Time in Seconds is the amount of time it took for the image classification model to be tested with the test data set, which is 1460.88 seconds also 24.34 minutes. Samples per Second is 1.15, which is the amount of data samples or instances that the image classification model can process and make predictions on in one second. Latency in Seconds is 0.86, which refers to the amount of time it takes for a single prediction to be made by the image classification model in one second.In order to see how the image classification model can classify an image, see Figure 3.2 and Table 3.3. Figure 3.2 shows an example image of which the image classification model was tested on. The image has its true classification label above it, which is Apple-Apple Scab. Table 3.3 has the scores and labels of five different classification label predictions for what it thinks the image in Figure 3.2 could be. The score is on a scale from 0 to 1, with 1 meaning 100% confidence in the classification label that the model is predicting. The value of the score is split across the five predictions, meaning when we add up all of the score values for all five predictions, the value will add up to 1. The model’s top prediction with a score of 0.91% is the classification label Apple-Apple Scab, which is the true classification label of the image in Figure 3.2.The overall performance of the pre-trained ViT image classification model with data augmentation shows good promise. The best epoch provided that F1, Accuracy, Precision, and Recall, were all equal to 1.0. This is not the best situation but shows there is room for improvement.