Precision and recall are then calculated for a range of confidence score thresholds


Difficulties in image interpretation during the labeling stage are often much higher when working with remotely sensed images, due to spatial resolutions that are at the limit of being able to resolve objects, or because agriculture field classes appear similar to other field classes; crop types in particular are difficult to visually distinguish. In an ongoing study referenced by Elmes et al. , the accuracy of trained field annotators was evaluated against an expert generated reference set. A skill statistic that accounts for true positive and false positive rate was generated for each annotator, and each image was mapped by at least five annotators. The study found that there was a large range in agreement with the reference data among annotators, indicating that the costs of generating high quality training data from earth observations may also need to cover multiple annotators per unit of annotation as well as expert derived reference data to assess the accuracy of difficult annotations. Because of the high cost and lack of efficient tools for developing high quality training geospatial labels, most research using machine learning and earth observations develop or make use of small, regional datasets that are constrained to arbitrary political boundaries. These constraints bias models toward geographies which contain political entities with the means to produce labeled training data. It’s an open question to what extent these data can be leveraged to make predictions outside of the limited geography and time they were created. Nebraska is a high producer of cereal crops such as corn, millet,vertical farming hydroponic and soybean. These crops are used for bio-fuels, animal feed, and human consumption. Center pivots are a dominant feature across the state, particularly along the Platte river.

However, there are multiple confounding features that make monitoring of field level landscape change difficult, including non-irrigated row crops, forested areas, circular hills, and the variability of center pivots themselves, which can be multi cropped, semicircular, and in various states of development ranging from cleared, to irrigated, to fallowing, to fallow. The majority of these fields are irrigated using groundwater from the Northern High Plains Aquifer . While the Northern High Plains has not experienced as much groundwater depletion as the Southern and Central HPA, portions of it, including the Upper Republic Natural Resource District, have experienced water table declines up to 40 meters, mostly to irrigate corn . Compared with the Southern HPA and Central HPA , the NHPA is less depleted due to historically later stage irrigation development, higher recharge, and higher precipitation. The NHP also contains larger volumes of water relative to the SHPA and CHPA, 2940 km3 versus 636 km3 and 171 km3, respectively . This research contributed new functionality to the python library solaris, which was used to tile large Landsat 5 scenes into 128 by 128 pixel image chips for training models while also excluding portions of scenes that fell outside of the Nebraska state boundary, where no labels were digitized. A 128 pixel x 128 pixel sample was selected in order to utilize the maximum amount of gpu memory to speed up the training process and to provide the model with more geographic diversity in each batch during the training step. Since many features are computed for each image that is used to train the model, GPU memory use for each image is much larger than the actual image memory footprint, and therefore all images must be served iteratively in the training process. Landsat 5 scene chips were saved as both geotiffs to preserve geospatial metadata, and rescaled, 8-bit JPEG files. The JPEG files were used for model training, since detectron2’s default training logic expects JPEG formatted images at the time of this writing.

Landsat 5 ARD scenes contain large amounts of NoData at scene edges, so vector labels that overlapped with these NoData areas were masked out. As tiff and JPEG files were tiled, vectors were also tiled and saved as geojson files. Fields at scene edges that had an area smaller than 4000 m^2 were discarded in order to remove processed labels that do not represent half or full center pivot fields. Solaris python functions were also used to rasterize geospatial vectors into multi channel tiffs, where each channel represents a binary mask of a unique field instance for training. These rasterized masks were paired with and tiled to the same size as the corresponding imagery. Two different training sets were used to evaluate the effect of training data size on model accuracy. The first set of processed scene chips were taken from the least cloudy scenes across the state of Nebraska during the 2005 growing season, such that no samples overlapped geographically. These image chips were divided into a set of training, validation, and test chips, with 13,625, 3264, and 861 samples assigned, respectively. The second training data set used 50% of the training data set used by the first set. These splits were determined by collecting geospatial IDs for all image chips and randomly assigning chips with particular geospatial IDs to each set. This means that all sets are geographically independent from each other, though there is inevitably scene similarity since most of Nebraska has a humid climate and agricultural landscape. A given set will have some geospatial IDs represented multiple times, in cases where Landsat 5 captured multiple scenes during the study period. This provides more representation of center pivot fields at different stages of development, from cleared for development, to cultivating, to harvested. Blue, Green, Red, and NIR bands were used to train two sets of models in RGB and GRNIR 3-band combinations. In each case, the channel-wise means for the whole training set were calculated in order to normalize the models inputs by subtracting the channel mean from each individual scene’s respective channels. A Resnet-50 CNN backbone loaded with pretrained weights from Imagenet was downloaded and used as a starting point for training and fine-tuning the FCIS and Mask R-CNN models .

Fine tuning from pretrained weight is a common practice in deep learning that has been shown to lead to faster model convergence with similar ultimate model accuracies and less hyperparameter tuning, though recent work has shown that pretraining does not always lead to higher overall accuracies . All of the following Mask R-CNN and FCIS results were obtained from a Standard NC6 Microsoft Azure virtual machine, with 6 virtual cpus, 56 GB memory and 4 Titan V GPUs . Mask R-CNN models took roughly two hours each to train, while the FCIS model took 8 hours to train, given the implementation’s limitation of having a batch size of one image and only being able to use one GPU at a time. For the Mask R-CNN model, a standard detectron2 training loop was used. Table 1 contains the hyperparameters used for the model, including the optimizer, learning rate, batch size, etc. Particular configurations were changed from the defaults in order to adapt the training process for the Nebraska dataset. Relative to Imagenet or the COCO dataset, the Nebraska dataset has a larger average and a wider range in the number of instances per image. Therefore, the maximum number of detections was set to 100. The default number of warm up iterations was decreased in order to stop the learning rate from increasing to an amount that would make convergence difficult,vertical planters for vegetables and the initial base learning rate was increased by an order of magnitude for a faster learning process, given that we started from pretrained weights on a training dataset size that is more limited than Imagenet or COCO. All decisions to adjust these hyperparameters were made after reviewing average precision and average recall metrics and loss on the validation set relative to the training loss during the training process .Referring to Table 1, the non max suppression configurations were increased to improve detections on scenes with multiple center pivot images. Non max suppression is a step that removes low confidence region proposals prior to later steps in the model that refine the bounding box around an object and generate a final object mask. However, the result from this setting did not substantially differ from the default. Likewise, detections per image was increased so that scenes with many instances did not have missed detections because of an arbitrary limit.

The Model Freeze setting determines which parameters are allowed to be learned at different stages of the pretrained model. By changing the setting from 0 to 2, parameters that represent higher level features were learned. This did not impact model accuracy compared to using the default but it did speed the training of the model. The max iteration for warmup was adjusted so that the learning rate increased during training faster, given that the model was run for comparatively less iterations than the default model, which is set up to run on the larger COCO dataset. Average Precision and Average Recall are metrics that account for the variation in confidence scores associated with model detections as well as the trade-off between precision and recall . Each detection has a confidence score ascribed to it by the model, or a sequence of confidence scores in the case of multi class classification. A confidence score threshold determines whether a prediction should be considered or not considered when calculating precision or recall. Lower confidence scores tend to increase recall by potentially keeping more predictions while higher confidence scores decrease recall as lower confidence predictions that match the reference labels are not considered. To calculate Average precision, predictions are generated for all images and ranked by confidence score.The range of precision and recall values are then averaged to provide an Average Precision and Average Recall for an entire validation or test dataset. This process is repeated for IoU values ranging from .5 to .95, and the results are then averaged to report an average of averages, which is commonly referred to as AP:.05-.95 and AR:.05-.95, or just AP and AR. Henceforth, we will refer to AP:.05-.95 and AR:.05-.95 as AP and AR.AP and AR are calculated for specific object size ranges to evaluate the model’s ability to generate high confidence predictions across different field sizes. Field ranges used for both the FCIS and Mask R-CNN models are divided as follows: the Small size range is 0 – 0.43 km^2, Medium is 0.43 – 0.52 km^2 , and >0.52 km^2 is the Large size range. These metrics aren’t sufficient by themselves to resolve whether the Mask RCNN or FCIS model performed better on the Nebraska dataset. Therefore, to understand this aspect of performance, I inspected multiple examples visually, stratifying by qualitative landscape complexity. I plotted and compared high confidence detection results from Mask R-CNN and the FCIS model in order to evaluate the most accurate predictions from each model that would be most likely to be used in subsequent analysis. While these metrics illustrate the model’s ability to roughly map an object’s boundary, given the minimum criteria for an object detection is an IoU of 50%, it is also informative to inspect visual differences in how well object boundaries match the reference labels. For comparisons 1 and 3, examples were chosen across scenes of varying landscape complexity in order to provide a more comprehensive survey of the CNN models’ ability to map fields using relatively coarse resolution satellite imagery. This analysis primarily focuses on examples where the model fails to correctly detect fields with high confidence. I also evaluated the distribution of likelihood scores associated with predictions for each example image to determine if there were differences in how confident model predictions were across difficult to classify scenes . The red line in the boxplot represents the median of all confidence scores for detections in the small, medium, and large categories, while the red triangle represents the mean across all detections. In each boxplot of detection likelihood scores, the distribution shown is for all detections made for a particular scene, including center pivots that did not meet the 90% confidence threshold and were not displayed. Figure 10 and 12 highlight that the FCIS model generates boundaries that are eccentric relative to the reference label boundaries and actual field boundaries in the image.