Big Data and Machine Learning
in Animal Breeding
Dinesh Thekkoot PhD, Genesus Inc
Machine learning (ML) has been a buzz word for the past few years. ML is a subfield of artificial intelligence (AI) dedicated to the study of algorithms to predict outcomes. Knowingly or unknowingly we are heavily dependent on ML in our day-to-day life.
The Virtual personal assistants like Siri, Alexa and Google Now, the personalized news feed in our social media, friend suggestions that we see in a Facebook feed, email spam and malware filtering, the traffic predictions that we see on our GPS, etc. are some of the services/technologies we use regularly that are based on ML algorithms.
Even though ML plays a very important role in our day to day life, application of this technology in animal breeding/production is still in infancy. The recent developments in modern technologies like automated feeding and weighing systems, digital imaging, large scale genotyping, etc. have enabled farmers, breeders, and related industries to continuously monitor and collect a large amount of data (big data) at the animal level at a reasonable cost.
The number of rows and columns in this big data is often so large that it is very difficult to visualize this using regular computer programs. Also, in most cases, this data is not always “clean”, as it can contain missing values, outliers, and unwanted data points.
Another issue is the amount of data per animal in the case of genetic evaluation. Most current data analyses methods assume that the number of pieces of data per animal is not large. For example, suppose if we have 500 sows, each sow will have few litter records and few growth records, i.e. we have 500 sows with 10 or 15 data points for each sow. However, updated data collection technology, on each of those 500 sows we may have more than 50,000 pieces of genotype information, several thousand lactation feed intake records and several thousand farrowing room environment measures such as temperature and humidity recorded every 5 minutes. So, the same 500 sows will each have thousands (or even millions) of pieces of data. Many statistical methods have been developed to address this problem, but they require extremely large amounts of computer resources. ML has proven to be an efficient method to address all of these issues.
Learning from data is the core principle of machine learning, and it aims to choose from a large pool of data analysis models, that can predict the outcomes most accurately. This part is called the training process, and there are two types of training:
- Supervised training (Machine learns from existing examples like genotypes and corresponding phenotypes) and
- Unsupervised training (No prior examples required, like the situation where we have only genotypes).
Some of the applications of machine learning in animal science:
- Genomic prediction: One of the earliest attempts of ML was tried in genome-enabled prediction. Initial results show that ML methods performed better than traditional methods when the underlying genetic architecture was complex (when traits were controlled by dominance and/or epistasis) (Li et al. (2018) Front. Genet. 9:237).
- Genome-wide association studies (GWAS): Publications have shown that ML methods can be used for conducting GWAS. Also, ML methods have been shown to be more efficient in identifying a subset of SNPs with a direct link to candidate genes (Li et al. (2018) Front. Genet. 9:237).
- Genotype imputation: While genotyping, not all markers will get genotyped due to quality issues, and we will have to predict the missing marker genotypes using a process called imputation. Studies have shown that ML methods have higher accuracy for imputing these missing genotypes (Morota et al. (2018) J. Anim. Sci. 96:1540–1550).
- Phenotype quality check: ML models have been shown to be successful in identifying outliers in the data and can be applied to filter and edit data prior to genetic evaluation (Morota et al. (2018) J. Anim. Sci. 96:1540–1550).
- Image analysis: ML methods can be used for predicting body weight from camera images rather than using a weight scale, which is laborious, time-consuming and causes stress on animals. Also, these methods can be used to predict carcass composition from on-line camera images in real time.
At Genesus, we generate large amounts of data from individual feed intake machines, carcass, and pork quality programs that dates back more than 20 years, and from genotyping many selection candidates per week. These large swaths of data can be classified under the big data category and have been an integral part of our regular genetic evaluation program, along with our routine growth and reproductive phenotypes.
Currently, we are in the process of investigating predictive ML approaches for analyzing these data more efficiently. All these steps will help to increase the genetic improvement rate and will ultimately benefit Genesus customers.