Data considerations for developing deep learning models for dairy applications: A simulation study on mastitis detection
2022
Naqvi, S Ali | King, Meagan T.M. | DeVries, Trevor J. | Barkema, Herman W. | Deardon, Rob
With growing adoption of precision dairy technologies, the use of big data is becoming increasingly common in the dairy industry. The speed at which data are generated has led to increased interest in developing detection and predictive models for animal health and disease events using real time records. When combining data from multiple sources, statistical methods exist to account for the underlying heterogeneity in data collected from commercial farms, although its impact on predictive models is not known. We investigated how 4 different issues commonly seen in these large datasets impact the performance of deep recurrent neural networks (RNNs) trained to detect the onset of clinical mastitis (CM) in dairy cows. Data were simulated by first sampling from real-world data and adding noise, then defining the association between predictor variables and CM while incorporating parameters to reflect underlying heterogeneity: 1) random effects to reflect unmeasured variability at the farm level (3 levels – none, moderate, high); 2) random effects to reflect unmeasured variability at the cow level (3 levels – none, moderate, high); 3) missed recording of CM cases (3 false-negative rates – 0.10, 0.25, 0.50); and 4) incomplete observations due to certain farms not having a somatic cell count sensor (SCC data missing vs SCC data included). At baseline (moderate farm and cow random effects; moderate misclassification; 42% herds with SCC sensor) the model achieved a sensitivity and specificity of 86% and 90% respectively. Higher levels of unmeasured variability at the farm and cow levels resulted in reduced model performance (sensitivity and specificity of 76% and 85% at the highest levels), indicating that data collection and feature selection should be informed by previous knowledge of the associations between the outcome and predictors when possible, and that model performance may be limited when predictors are selected only from routinely collected data. However, even when 50% of CM cases were incorrectly recorded as CM-negative, model performance did not decrease, demonstrating that deep RNNs are robust to the level of misclassification that would be typically encountered in dairy datasets. RNNs were also able to accurately detect CM onset even when a highly predictive variable, somatic cell count, was excluded from training and test data, but the models took longer to train. The effect of unmeasured variability on model performance demonstrates how predictors should be selected for RNNs, whereas RNNs appear to be very robust to misclassification in training data as well as missing variables. Researchers developing studies using deep learning should therefore focus their attention more on predictor selection than on reducing or standardizing outcome recording, since RNNs appear to be robust to the latter, while being more strongly impacted by the former.
Show more [+] Less [-]AGROVOC Keywords
Bibliographic information
This bibliographic record has been provided by National Agricultural Library