Deep learning classification method for boar sperm morphology analysis

Abstract

Background

Boar semen quality emphasizes three major criteria: sperm concentration, motility, and morphology. Methods to analyze concentration and motility quickly and objectively readily exist, but few exist for analyzing morphology outside of subjective manual counting. Other vital factors for fertilization, like acrosome health, lack efficient detection methods due to limitations in detection by the human eye and costly biomarker analysis, which is rarely used in semen diagnostics.

Objective

To overcome these challenges, we propose a novel approach integrating deep-learning technology with high-throughput image-based flow cytometry (IBFC) for objective and accurate analysis of both morphology and label-free acrosome health of thousands of individual spermatozoa at once, as opposed to manually counting on a microscope slide.

Materials and methods

Images of 10,000 spermatozoa were captured using an IBFC and manually annotated based on the primary morphological defect or acrosome health status for the training of the convolutional neural network (CNN). The CNN used these images to train and then applied that training to unannotated images to predict the model accuracy.

Results

Using the CNNs, high F1 scores of 96.73%, 98.55%, and 99.31% for 20x, 40x, and 60x magnifications, respectively, for morphological classification were attained. Additionally, the model demonstrates an F1 score of 99.8% in detecting subtle acrosome health variations at the 60x magnification.

Discussion and conclusions

We have established an integrated approach to rapidly collect and classify morphological defects and acrosome health status, without the use of manual counting or biomarker labeling. Our study underscores the potential of artificial intelligence in semen diagnostics, reducing technician variability, streamlining assays, and facilitating the development of additional label-free detection methods. This innovative approach addresses the barriers hindering biomarker adoption in semen analysis, offering a promising avenue for enhancing reproductive health assessments.

1 INTRODUCTION

Sperm morphology has been used as a tool to diagnose male fertility for decades, alongside motility and concentration. It is well understood that boar semen with a high percentage of proximal and distal retained cytoplasmic droplets has a negative correlation to both pregnancy rate and litter size.1 Proper sperm morphology is also essential for the spermatozoan’s ability to move through the female reproductive tract,2 bind the oviductal sperm reservoir,3 and navigate their way to the oocyte.4

Historically, sperm morphology analysis has been performed manually using slide microscopy and various bright-field whole-cell stains, such as hematoxylin and eosin. Ideally, this analysis should take place at 100x magnification, a luxury to most semen diagnostic laboratories, and most are done around 60x magnification. This labor-intensive task is highly susceptible to the subjectivity of the technicians observing and recording the parameters, and how liberal or strict the criteria is for morphological classification can vary greatly between laboratories, making comparisons difficult. Additionally, prolonged and repetitive microscope use can lead to neck and vision problems5, 6 and compromise workplace ergonomics. The implementation of computer-assisted semen analysis (CASA) in fertility laboratories has reduced the variation in results caused by technician subjectivity7 but remains too expensive for widespread use in livestock semen diagnostic laboratories. While CASAs are effective for motility and limited classes of morphological defects, there is a need for models that can detect a wider range of morphology defects at a higher throughput.

One such high throughput method that could facilitate this is image-based flow cytometry (IBFC). Image-based flow cytometry combines the fluorometric capabilities of traditional flow cytometry with high-speed, single-cell imaging. Such techniques offer high-throughput analysis of sperm quality8, 9 and provide an avenue to increase throughput without the need for manual sperm assessment and imaging under a microscope. Consequently, deep learning models, such as convolutional neural networks (CNNs), pair exceptionally well with IBFC and offer a novel approach to detecting and classifying many morphological defects within a sample at once without human intervention.

CNNs have their origins in the field of calculus, dating back to 1943 when Mcculloch and Pitts laid the groundwork for this type of artificial neural network.10 In the 1980s, Fukushima conducted pioneering research to utilize algorithms for very basic image recognition11 which was further advanced by LeCun, particularly in the area of automated zip code recognition.12 Since their inception, CNNs have become pivotal tools in the realm of image analysis and pattern recognition. Today, these deep learning models find extensive application across various domains including facial recognition on social media platforms, surveillance,13 emotion recognition,14 and the analysis of diagnostic images such as magnetic resonance imaging (MRI),15 computed tomography,16 histology,17 and numerous other fields.

In the human reproductive field, semen quality is typically lower compared to that of animals, with the World Health Organization (WHO) specifying anything above 4% normal morphology as indicative of good fertility, making the meticulous counting of morphological defects a labor-intensive task.7, 18 Given this complexity, deep learning-based models have become a popular approach with several promising models already existing. In machine and deep learning, precision is the percentage of events that are classified correctly and refers to the robustness of the model, recall is the percentage of the events that are predicted as true positives and classified correctly, and the F1 is a weighted average that balances the precision and recall scores to provide a complete score. Iqbal et al. proposed a CNN model that could distinguish the differences between 11 different human spermatozoa head defects with a recall score of 88%.19 Similarly, in 2021, Yüzkat et al. employed six different CNN models over three publicly accessible human spermatozoa image databases, with F1 scores varying from 28% to 95% across 12 different morphological categories.20 Numerous other human models exist with similar results, achieving F1 scores ranging from 55% to 98% using multiple different algorithms and deep learning models to find the best approach.2126 Many of these studies utilized publicly available datasets of images obtained through conventional microscopy. Some of these datasets used morphology-stained sperm cells while others were unstained and used only gray-scale, but all with less than 2000 unique images. Even andrology experts classifying defects for the datasets faced challenges in reaching unanimous classifications due to the inherent low-resolution and low-quality nature of the images obtained by standard microscopy.27 This not only highlights the subjectivity of morphology classification but also the effect of poor image quality and the time-consuming nature of taking enough images to train one of these models without the assistance of a high throughput method.

Furthermore, utilization of high throughput methods for phenotypic screening of sperm characteristics (such as biomarkers) that can be associated with fertility can provide a greater understanding of how these intricate relationships between sperm characteristics and morphology affect fertility outcomes.8 Traditional flow cytometry is already widely used to detect plasma membrane integration, DNA damage, acrosome health, reactive oxygen species, and more.28 By employing IBFC, the localization of zinc ions (a sperm fertility biomarker) elucidated distinct patterns, classified as zinc signatures, associated with the capacitation status of spermatozoa in boar, bull, and man.8 However, the applications of IBFC and deep learning for comprehensive assessment of boar sperm morphology and acrosome status have yet to be sufficiently explored.

The aim of the present study was to establish a deep-learning workflow for boar sperm. Emphasis was placed on the following major morphological classes: normal sperm morphology, proximal retained cytoplasmic droplet (PCD), distal retained cytoplasmic droplet (DCD), distal midpiece reflux (DMR), and coiled tail (Figure 1). These four abnormal morphologies make up 63% of common sperm abnormalities in boar.29 Henning et al. 2021 reported that boar spermatozoa with retained cytoplasmic droplets of >15% have a significantly decreased response to the sperm capacitation inducers of bicarbonate and calcium. This is important, as sperm capacitation is the last maturation event, endowing spermatozoa with the capacity to fertilize the oocyte.30, 31 Additionally, cytoplasmic droplets reduce the shelf life of fresh boar semen, which affects the profitability of using extended semen over multiple days.29 Of the remaining 37% of defects not accessed by our morphological model, 29% are related to acrosomal defects, and the other 7% are head abnormalities.29 Currently, the best way to detect acrosomal defects is with biomarkers such as fluorochrome-conjugated PNA lectin, which makes analyzing acrosomal defects during routine semen analysis uncommon due to the demanding nature of biomarkers and the expensive instruments needed to detect them. Therefore, acrosomal defects make the perfect candidate for label-free detection by utilizing high throughput IBFC paired with deep-learning CNN models to streamline semen analysis and add an extra layer of defect detection to current methods of fertility diagnostics.

Details are in the caption following the image
Panel showing bright-field images acquired through image-based flow cytometry (IBFC) at the 60x magnification of boar spermatozoa. Representative of the manually annotated populations for each of the five morphological categories used: normal, proximal cytoplasmic droplet (PCD), distal cytoplasmic droplet (DCD), distal midpiece reflex (DMR), and coiled tail.

For this, we focused on determining the best microscope objective for this type of detection as well as the sensitivity of the model to detect both severe and moderate acrosomal defects. In this study, we coupled IBFC, for high throughput image collection, with Amnis AI (AAI) software (AMNIS Cytek Biosciences) to establish a workflow for boar sperm morphology classification and label-free detection of acrosome health. These methods illustrate the promising implications of coupling these two technologies for increased morphological classifications and fertility diagnostics by label-free biomarker detection.

2 METHODS AND MATERIALS

2.1 Ethics statement

This study involved the use of boar sperm samples which were obtained in collaboration with industry partners. The samples were not collected specifically for scientific research but were excess materials from standard production processes. The collection and handling of these samples adhered to standard industry protocols, designed to ensure ethical and humane treatment of the animals. Given the nature of the sample collection, being a byproduct of routine production activities, and its alignment with standard agricultural practices, this study was exempt from Iowa State University Institutional Animal Care and Use Committee (IACUC) oversight. Nonetheless, all procedures were conducted in strict accordance with relevant guidelines and regulations to ensure ethical integrity and animal welfare.

2.2 Semen collection and processing

Boar semen from 10 boars with varying morphology were used in the study. Semen was collected, extended, and delivered from a private boar stud following their established standard operating procedures. Upon delivery, semen was fixed in 2% formaldehyde for 40 min at room temperature, washed in PBS, and stored at +4°C until the time of image acquisition.

2.3 IBFC configuration and acquisition

Images were collected using the ImageStreamX Mark II (AMNIS Cytek Biosciences) IBFC, fitted with 20x, 40x, and 60x objective lens magnifications and a 647 nm laser. Images used for neural network training were acquired at all three magnifications. A minimum of 10,000 spermatozoa event images were collected per sample, with brightfield images acquired in channel 1 and conjugated PNA lectin images acquired in channel 5.

2.4 Sample preparation and labeling

Spermatozoa were washed of seminal fluids and sustaining media by centrifugation at 500xg for 4 min. Lectin PNA derived from Arachis hypogaea (peanut) conjugated to Alexa Fluor 647 (Thermo Fisher, cat# L32460) was added to 100 µL (about 5 million spermatozoa) of the sample at a dilution of 1:2000 from the stock concentration (1mg/mL) for a final concentration of 0.5 µg/mL. The fluorescent label was incubated with the cells for 30 min at room temperature before being washed and resuspended in PBS for IBFC analysis.

2.5 IBFC IDEAS data analysis

Images were initially gated for those that were in focus and contained one sperm cell (Figure S1A,B) as similarly done.8 Images chosen included only cells that were anteriorly/posteriorly aligned to the camera (as opposed to laterally aligned). Cells orientated with both the spermatozoa head up and head down were included. Brightfield images were manually annotated for morphology classification in Amnis IDEAS version 6.3 software (AMNIS Cytek Biosciences). Images used for label-free analysis were first gated for PNA intensity (Figure S1C). Each annotated class had 2000 images for training and testing. Annotated populations were exported to a new Data Analysis File (.daf) for each category of morphological defect or PNA intensity and used for importing into Amnis AI Software (AMNIS Cytek Biosciences).

2.6 Amnis AI software description

Amnis AI software version 2.0.7 was used for deep learning analysis. This software uses the Keras Application Programming Interface (API) version 2.1.532 with TensorFlow version 1.7.0 library33 to train models based on ground truth input data and apply trained models to classify new data. The CNN architecture used in the AI software is based on the VGG16 network.34 Pixel values of all images imported by the user are normalized to the range [0 1] and the image size was set to 350×350 pixels to accommodate the largest images expected for the application. Truth data is split into training, validation, and test sets using an 80/10/10 ratio. This software utilizes data augmentation35 and class balancing to control for classification bias and to enhance the robustness of the trained model to test generalizations and automatically calculate the necessary epochs needed based on incident numbers (AMNIS Cytek Biosciences). All computations were run on a custom-built deep-learning computer. The system was composed of a Ryzen 7 3900x CPU (12 cores/24 threads), Nvidia RTX 2080 Super graphics card, 64 gigabytes of random-access memory), 1 TB non-volatile memory express active data drive, with Windows 10 as the host operating system

2.7 AI model training

The manually annotated images were used to train the CNN classifier models using Amnis AI software. For the morphology models, five classes were created: normal, PCD, DCD, DMR, and coiled tail. Classifications were performed using bright field channel 1. Truth populations for each class contained 2000 images for a total of 10,000 images.

For the label-free acrosome health models, three classes were created based on the fluorescent intensity: high intensity denoted as PNA double positive (PNA++) indicating extreme acrosome defects, moderate intensity denoted as moderate PNA, indicating moderate acrosome defects, and no intensity denoted as PNA negative (PNA-) indicating not acrosome defects. Images used for the deep learning PNA+ class were obtained from cells with both moderate and extreme acrosome defects (Figure S1) unless otherwise noted. Classifications were performed using bright field channel 1, no fluorescence channels were used in training. Truth populations for each class in the 20x, 40x, and 60x models, contained 2,000 images for a total of 4000 images. Truth populations for each class in the 40x model contained 10,000 images for a total of 20,000 images. Truth populations for each class in the 40x cells PNA++-only model contained 5000 images for a total of 10,000 images. Images used in the PNA+ population for this model were obtained only from cells with extreme acrosome defects.

2.8 Performance metrics

Amnis AI software automatically calculates statistical output for precision, recall, and F1. It generates accuracy matrix tables and confusion matrices as presented throughout.

3 RESULTS

3.1 20x/40x/60x morphology models

In this study, accurate prediction is achieved using automated detection and classification of five major classes of boar sperm morphological defects. It shows that thousands of images can be accurately classified without subjectivity and technician fatigue by utilizing IBFC coupled with deep-learning CNN models.

The workflow described below is summarized in Figure 2. Three morphology models were trained using images collected by an ImageStreamX Mark II (AMNIS Cytek Biosciences) IBFC with three objective lens magnifications: 20x, 40x, and 60x. Each model was trained with AAI software (AMNIS Cytek Biosciences) using the same five morphological categories: normal, PCD, DCD, DMR, and coiled tail. The same number of images for each morphology category was used to train each model (n = 2000/category) to isolate the objective lens as the experimental variable.

Details are in the caption following the image
Workflow summary for acquisition and processing of images for us in convolutional neural network (CNN) model training.
For each morphology model, a total of 10,000 truth-defined images were used to train, validate, and test the model using AAI. The AAI software provided accuracy statistics evaluating model performance and efficiency and included precision, recall, and F1, all of which are commonly used metrics for machine learning. These are defined below:

True positive (TP) and true negative (TN) represent the objects that were manually tagged as true-defined images, and false positive (FP) and false negative (FN) represent the number of objects incorrectly predicted by the model.

For training, validation, and testing, accuracy metrics are given for each category (Table S1). The weighted average F1 of each model (Table 1) is the metric designated as each model’s overall performance.

TABLE 1. Accuracy metrics for each class for the 20x, 40x, and 60x objective magnifications.
20x
Model class Precision (%) Recall (%) F1 (%) Support
Normal 94.62 92.35 93.47 2000
PCD 92.44 96.55 94.45 2000
DCD 98.93 97.05 97.98 2000
DMR 98.4 98.3 98.35 2000
Coiled Tails 99.4 99.35 99.37 2000
Weighted average 96.76 96.72 96.73 10,000
40x
Model class Precision (%) Recall (%) F1 (%) Support
Normal 97.5 95.7 96.59 2000
PCD 95.98 98 96.98 2000
DCD 99.35 99.25 99.25 2000
DMR 99.95 100 99.98 2000
Coiled Tails 100 99.9 99.95 2000
Weighted average 98.56 98.55 98.55 10,000
60x
Model class Precision (%) Recall (%) F1 (%) Support
Normal 99.54 97.1 98.3 2000
PCD 97.74 99.6 98.66 2000
DCD 99.3 99.85 99.58 2000
DMR 100 100 100 2000
Coiled Tails 100 100 100 2000
Weighted average 99.32 99.31 99.31 10,000

Due to the quality (number of pixels per micron) of images at each objective (Figure 3), it was expected that the 60x objective would produce the most accurate model, which was reflected by the models’ weight average F1(%) of 96.73, 98.55, and 99.31 for 20x, 40x, and 60x, respectively (Table 1). Comparatively, a human technician with the same data sets for each objective would be expected to correctly classify each image into the correct category 95% of the time, as defined by industry common metrics. Making an F1 of >95% equal to or better than what is considered humanly acceptable.

Details are in the caption following the image
Comparison of the quality of images acquired at the 20x, 40x, and 60x objective magnifications using image-based flow cytometry (IBFC) for boar spermatozoa for each morphology classification used in the morphology models including normal, proximal cytoplasmic droplet (PCD), distal cytoplasmic droplet (DCD), distal midpiece reflex (DMR), and coiled tail. Images acquired in bright-field channel 1.

Each model was trained independently, ensuring no additive learning between the models for each objective. F1 scores for each category, over all the objectives, ranged from 93.47% to 100% (Table 1). As predicted, the 20x model had the lowest F1 for every category, ranging from 93.47% at the lowest, and 99.37% at the highest, while the 40x F1s ranged from 96.59% to 99.98%, and 60x ranged from 98.3% to 100%. The 40x and 60x models exceed the 95% standard and are excellent candidates for further training and classification on larger datasets that include additional morphological defects.

For each model, a normalized confusion matrix (Figure 4) was produced, as well as a truth vs. predicted count (Figure 5A). The primary constraint for all models was the ability to distinguish between spermatozoa with proximal cytoplasmic droplets and normal spermatozoa. Presumably, this distinction is challenging for the model to differentiate due to the proximity of some droplets to the base of the head’s post-acrosomal sheath region, making them less pronounced than those situated at a greater distance from the head. The models tend to underpredict the normal spermatozoan and overpredict PCD-containing spermatozoan, as seen in Figure 5A (exact numbers in Table S2). As anticipated, increased image quality (pixel number) from 20x to 60x removes much of the confusion from the PCD and normal categories with each objective, 96.5% and 92.3% for 20x, respectively, 98% and 95.7% for 40x, respectively, and 99.6% and 97.1% for 60x respectively (Figure 4). Contradictory, this pattern is not seen when looking at the actual number of images being underpredicted for the normal category, as reflected in Figure 5A. Instead, there was an underprediction of 48 (20x), 37 (40x), and 49 (60x) individual cells. These numbers represent an increase in underpredicted normal images at 60x compared to 20x. Whether this discrepancy is due to the AAI model or the training set, requires further optimization.

Details are in the caption following the image
Normalized confusion matrices given by Amnis AI (AAI) for the 20x, 40x, and 60x magnification training models, detailing the performance for each morphology category within each model. Within these matrices, deep red and light creams indicated reduced confusion, while shades of orange show higher levels of confusion. The number in the center represents the percentage of images called within those parameters.
Details are in the caption following the image
True versus predicted counts provided by Amnis AI (AAI) for the 20x, 40x, and 60x magnification convolutional neural network (CNN) training models, detailing the performance of each model. (A) Results from the morphology models and (B) the results from the PNA label-free models. Green bars represent data from the 20x models, blue bars represent data from the 40x models, yellow bars represent data from the 60x models, and grey bars represent the truth population for all the models.

3.2 20x/40x/60x label free acrosomal health model

For the label-free detection of acrosomal health, five total models were trained using a similar workflow (Figure 2). Sperm cells were stained with lectin-binding PNA conjugated to Alexa-Flour 647 for acrosome health detection. Images for brightfield and PNA intensity were collected using an ImageStreamX Mark II IBFC at 20x, 40x, and 60x objective lens magnification. Examples of these images across the three objectives are shown in Figure 6. Flow cytometry data for PNA fluorescent intensity was analyzed using IDEAS (AMNIS Cytek Biosciences) and was broken down into three populations based on PNA intensity: no/low PNA, moderate PNA, and high PNA (Figure S1C). These three populations relate to the severity of acrosomal defects, from the absence of acrosomal defects to extreme defects. Two categories of images were created using these defined populations: PNA- (only images from no/low PNA populations) and PNA+ (images from moderate PNA and high PNA populations).

Details are in the caption following the image
Comparison of the quality of images acquired at the 20x, 40x, and 60x objective magnifications using image-based flow cytometry (IBFC) for boar spermatozoa classified as PNA negative (PNA-) or PNA positive (PNA+) as determined by lectin PNA labeling. Lectin PNA (magenta) binds to damaged/ruffled acrosomes but not intact ones in live boar spermatozoa. Images were acquired in bright-field channel 1 and fluorescent channel 5 and merged for visualization.

Initially, only three models were trained using 10,000 truth-defined images at 20x, 40x, and 60x magnification lens objectives. Each model was trained independently to avoid additive learning. Again, accuracy statistics were provided by the AAI software, and the weighted average F1 is the metric designated for each model’s overall performance. Similar to the results of the morphological models, the lowest F1 came from the 20x model (64.01%), with an increase at 40x (68.14%) and the highest from the 60x objective (99.8%), found in Table 2. Precision, recall, and F1 scores for all training, validation, and testing classes are in Table S3. The increase in predictability observed in the morphological models was not mirrored by a similar rise in this models’ F1 scores from 20x to 40x magnification. To explore potential reasons for this lack of increase in F1 score at 40x, two additional models were trained. These models were designed to investigate if the diminished F1 was due to the model’s inability to detect moderate acrosomal defects or the size of the training dataset. The fourth model increased the number of images in the training dataset from 4,000 to 20,000, and the fifth model used only images from the population of spermatozoa with extreme acrosomal defects. The F1 scores for the fourth and fifth models were 66.07% and 82.27%, respectively (Table 2), suggesting that at the 40x objective, the ability to detect moderate acrosomal abnormalities is limited.

TABLE 2. Accuracy metrics for each class for all label-free acrosome health convolutional neural network (CNN) models.
20x, 2000 cells 40x, 2000 cells
Model Class Precision (%) Recall (%) F1 (%) Support Model Class Precision (%) Recall (%) F1 (%) Support
PNA + 61.92 74.8 67.75 2000 PNA + 63.93 88.8 74.34 2000
PNA – 68.18 54 60.27 2000 PNA – 81.67 49.9 61.95 2000
Weighted average 65.05 64.4 64.01 4000 Weighted Average 72.8 69.35 68.14 4000
40x, 2000 cells 40x, 10,000 cells
Model Class Precision (%) Recall (%) F1 (%) Support Model Class Precision (%) Recall (%) F1 (%) Support
PNA + 63.93 88.8 74.34 2000 PNA + 61.94 99.47 76.35 10,000
PNA – 81.67 49.9 61.95 2000 PNA – 98.66 38.89 55.79 10,000
Weighted average 72.8 69.35 68.14 4000 Weighted Average 80.3 69.18 66.07 20,000
60x, 2000 cells 40x PNA++ Only
Model Class Precision (%) Recall (%) F1 (%) Support Model Class Precision (%) Recall (%) F1 (%) Support
PNA + 100 100 100 2000 PNA + 89 73.94 80.77 5000
PNA – 100 100 100 2000 PNA – 77.71 90.86 83.77 5000
Weighted average 100 100 100 4000 Weighted Average 83.35 82.4 82.27 10,000

Upon comparing the number of predicted images (Figure 5B, exact numbers in Table S4), the model tends to overpredict the PNA+ category at 20x and 40x magnifications, indicated by increased count in the predicted classification compared to truth. Analysis of images predicted as FP in the PNA+ category suggests that image focusing may be a contributing factor. Normalized confusion matrices are provided for all five models in Figure 7.

Details are in the caption following the image
Normalized confusion matrices given by Amnis AI (AAI) for each training model. (A) Performance of acrosome health predictions across the three different magnification levels (20x, 40x, and 60x). (B) Performances specifically for the 40x magnification, showcasing variations in performance based on the number of images utilized or the exclusion of moderate acrosome defects from the PNA+ images. Within these matrices, deep red and light creams indicated reduced confusion, while shades of orange show higher levels of confusion. The number in the center represents the percentage of images called within those parameters.

4 DISCUSSION

Fertility diagnostics has relied on morphology classification for decades and has primarily been achieved through microscopy performed by trained laboratory technicians and reflected by current WHO health standards. Technician training differs between labs and relies on the subjectivity of the technician performing the diagnostics (see review by Gatimel et al.7). Additionally, prolonged microscope use leads to technician fatigue and neck and vision injuries.5, 6 Technician subjectivity and fatigue can lead to misrecognition of cell characteristics and inaccurate sperm analysis. While CASA systems have become more widely used and alleviate technician subjectivity in motility analysis, these systems have limited ability for morphology analysis, making it essential to develop better methods for unbiased morphological analysis. Advancements in CNNs, which are innately good at pattern recognition, make them an ideal candidate for this type of biological analysis. Furthermore, coupling these technologies to high-throughput image collection methods such as IBFC increases the efficiency of training these models by decreasing the labor needed to image thousands of cells on a microscope.

In boar semen, cytoplasmic droplets and bent or coiled tails comprise 63% of morphological defects, while the remaining 37% are acrosomal or head defects. Some defects, such as cytoplasmic droplets, decrease shelf life and fertilization rates29 at abundances greater than 15%. Therefore, correct morphological analysis in boar semen is vital to achieve the best fertilization rates. For this reason, focus was emphasized on creating an accurate and robust model for classifying these most common boar sperm defects, including proximal cytoplasmic droplets, distal cytoplasmic droplets, distal midpiece reflexes, and coiled tails. Through examination of three different commonly used microscope objective lenses, it was established that all three objective lenses (20x, 40x, and 60x) were sufficient for the classification of morphological defects with F1-scores (%) of 96.73, 98.55, and 99.31, respectively. An F1 score was deemed sufficient based on being at or above 95% accuracy, which is the typical threshold used for human technician accuracy for morphological analysis. Greater attention should be given to the 40x and 60x magnifications, though, as they provide increased accuracy compared to the images acquired at the 20x magnification. It should be noted that while this model performed well on the chosen morphology classes, supervised learning such as this can only classify the defects in which it is trained. Therefore, future studies should include additional defects to capture a higher percentage of all morphological defects in boar spermatozoa.

Conventional diagnostic approaches predominately rely on biomarker labeling to classify the remaining 37% of acrosomal and head defects, as decerning subtle differences in acrosomal membranes through bright light microscopy is inherently challenging for the human eye. However, biomarker labeling has several constraints. First, its implementation can be financially burdensome due to the cost of acquiring the instrumentation and per-sample reagents expenses. Additionally, the preparation and analysis of samples in biomarker labeling assays are time-intensive, typically ranging from 2 to 8 h per assay, in stark contrast to label-free methodologies, which offer rapid results within seconds. Moreover, the utilization of chemicals such as DMSO and others for biomarker reconstitution and downstream labeling poses concerns regarding cytotoxicity, thereby potentially distorting cellular health metrics and compromising the suitability of cells for subsequent employment in intracytoplasmic sperm injection.36 Collectively, these limitations render the incorporation of biomarkers in diagnostic procedures within andrology laboratories economically unfeasible.

Nonetheless, recent advancements in deep learning demonstrate promising prospects for achieving high precision, recall, and F1 scores in spermatozoa classification, surpassing conventional human consistency and accuracy standards. In this study, we achieved 100% precision, recall, and F1 scores on all parameters for two morphological categories and no less than 97% on the other three morphological categories. This underscores the potential of deep learning techniques to accurately and efficiently characterize spermatozoa based on their morphological characteristics, particularly in identifying common defects found in boar spermatozoa.

Moreover, the study’s findings shed light on the efficacy of label-free detection methodologies, particularly with respect to the choice of microscopy objectives. The utilization of a 60x objective detects a broad spectrum of acrosomal defects, ranging from moderate to extreme. In contrast, the 40x objective appears more limited in its ability, predominantly suited for identifying extreme defects exclusively. These insights provide valuable guidance for optimizing diagnostic approaches in andrology laboratories, enhancing accuracy and efficiency in spermatozoa classification.

In conclusion, our study demonstrates the efficacy of leveraging rapidly growing technologies such as CNNs to improve boar sperm morphology analysis. Through the utilization of IBFC coupled with deep learning, a CNN model capable of accurately detecting both moderate and extreme acrosomal defects in boar spermatozoa was successfully trained while achieving an F1 score of 99.8% at 60x magnification without the need for biomarker labeling. This methodology offers a viable alternative to traditional biomarker labeling techniques, circumventing the drawbacks of expense, time-intensiveness, and potential cytotoxicity associated with chemical labeling methods.

Moreover, our findings pave the way for the application of deep-learning-based approaches in fertility diagnosis, to mitigate variability arising from technician subjectivity and fatigue. By harnessing the power of CNNs, which have proven effective in diverse applications ranging from facial recognition to assessing cellular health, this presents a promising avenue for enhancing the consistency and accuracy of fertility assessments. This approach holds significant potential for standardizing diagnostic protocols and streamlining workflows in andrology laboratories, ultimately contributing to improved reproductive health outcomes. Furthermore, our study lays the groundwork for future research endeavors aimed at optimizing label-free biomarker detection methodologies in boar spermatozoa.

AUTHOR CONTRIBUTIONS

Karl Kerns conceptualized, supervised, and secured funding. Alexandra Keller, McKenna Maus, and Emma Keller carried out experiments. Alexandra Keller analyzed all data and carried out model training. Alexandra Keller wrote the original draft. All authors provided feedback on the manuscript.

ACKNOWLEDGMENTS

This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2022-67015-36298 from the US Department of Agriculture’s (USDA) National Institute of Food and Agriculture (KK).

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT

The data underlying this article will be shared on reasonable request to the corresponding author.