New Calibration Method for Machine Learning Classifiers Improves Automated Diagnosis of Non-melanoma Skin Cancer

by Christos Evangelou, MSc, PhD – Medical Writer and Editor

The clinical utility of deep learning has been investigated extensively. Indeed, deep learning algorithms have shown promising diagnostic performance in digital pathology applications. However, site-dependent differences in preanalytical variables, such as the type of slide scanner and staining procedure, may lead to the generation of whole slide images (WSIs) with visual presentations that may differ from those of training images.

This variation in the characteristics of training, validation, and test images results in a phenomenon called domain shift, contributing to the reduced diagnostic performance of deep learning when applied to test WSIs generated at external sites (i.e., different from where training WSIs were generated).

In a recent study, researchers developed a new method to ameliorate preanalytic visual variances in WSIs generated at different sites and to improve the generalizability of machine learning classifiers.1

“Our results demonstrated that machine classifiers trained on slide images from the training site can be calibrated and adjusted to new images from the test site using a small number of images corresponding to a different body region,” said Anant Madabhushi, PhD, professor of Biomedical Engineering at Emory University and Georgia Institute of Technology, Research Career Scientist at the Atlanta Veterans Affairs Medical Center and corresponding author of the study.

“The main takeaway from this study is that the machine classifiers can be calibrated to images on the test site so that they are more harmonized in terms of slide appearance. This approach enables the development of more generalizable classifiers that will be clinically useful and deployable,” he added.

The study was a collaborative effort between Case Western Reserve University, Emory University, Georgia Institute of Technology, Cantonal Hospital Aarau, Southern Sun Pathology, and the University of Queensland. The findings of the study were published in Medical Image Analysis.


Study Rationale: Improving the Generalizability of Machine Learning Classifiers

Despite increasing interest in the use of artificial intelligence (AI) in routine pathology practice, a major concern with the use of deep learning algorithms for diagnosis is their sensitivity to preanalytic variations across different sites. These variations lead to batch artifacts and reduce the reliability of AI-based image analysis.

“Machine classifiers developed using data from one laboratory do not generalize to data from another laboratory,” Dr. Madabhushi said. “This is because of the differences in preanalytic sources of variance. Slides prepared and digitized across different sites or laboratories have different staining protocols and slide appearances. Hence a classifier trained on one site might not work on slide images from another site,” he explained.

“To ameliorate batch artifacts and improve the generalizability of machine learning classifiers, we presented an approach where we would use a small set of slide images from the test site, but from a different organ compared to the training dataset, to calibrate the classifier so that it generalizes to slide images from the test site.”

They termed this new approach “multisite cross-organ calibration-based deep learning,” or simply MuSClD. This new method allows pathologists to calibrate machine learning classifiers using WSIs from an off-target organ created at the same site as the target organ. Calibration is based on the “assumption that cross-organ slides are subjected to the same preanalytical sources of variance.”


Classifier Calibration Mitigates Variability in Visual Presentations of WSIs

The team used WSIs of an off-target organ to calibrate the training data. These WSIs were generated at the test site. They defined “off-target organ” as any organ not used in the training and testing of the diagnostic pipeline.

In a proof-of-concept study, they showed that calibration of training data using WSIs of an off-target organ from the test site reduced variations in visual presentations between training and testing data.1

Dr. Madabhushi noted that this cross-organ approach also “prevented data leakage during calibration, a phenomenon where machine classifiers can inadvertently learn attributes on the test set during the training procedure, a significant concern when performing cross-site calibration of machine learning classifiers.”


Classifier Calibration Facilitates Automated Diagnosis of Non-melanoma Skin Cancer

To determine the diagnostic performance of MuSClD when applied to WSIs generated in different laboratories, the team used their pipeline to analyze training and testing WSIs generated in two different countries. WSIs from an Australian cohort (n = 85) served as training dataset, and those from a Swiss cohort (n = 352) served as testing dataset.1 WSI analysis using MuSCID reduced variation (measured as Wasserstein distance) in image contrast, color, and brightness between the two WSI datasets. No batch artifacts were observed.

“In this case, we focused on developing a machine classifier for diagnosis of non-melanoma skin classifier using slide images from Sydney,” Dr. Madabhushi noted. “The test images were from Switzerland. In order to calibrate the classifier, we used a small set of lung pathology images from Switzerland to harmonize the classifier in terms of site-specific variations corresponding to the testing site,” he added.

The team also assessed the ability of MuSClD to automate diagnosis of non-melanoma skin cancer. To this end, they used MuSClD to identify basal cell carcinomas, in situ squamous cell carcinomas, and invasive squamous cell carcinomas. Receiver operating characteristic (ROC) curve analysis showed that MuSClD was able to decimate between different subtypes of non-melanoma skin cancer, yielding high area under the curve (AUC) values. MuSClD performed best in discriminating basal cell carcinomas and invasive squamous cell carcinomas (AUC, 0.92 for both) from other skin cancer subtypes, although discriminatory ability for in situ squamous cell carcinomas was also high (AUC, 0.87).1

Analysis of machine learning-based subtyping of non-melanoma skin cancer with and without classifier calibration revealed that calibration using off-target tissue mitigated domain shift and improved classification performance when applied to WSIs generated at different sites.

Commenting on the clinical relevance of these findings, Dr. Madabhushi said: “Our approach provided a higher accuracy in terms of diagnosis of non-melanoma skin cancer on the test site, suggesting that calibrated classifiers are better than non-calibrated classifiers in automated diagnosis of skin cancer using WSIs generated at different laboratories.”


Novelty of Approach

“The most novel aspect of this study is that we calibrated the machine classifiers using a small number of cases from the test site,” Dr. Madabhushi noted. Another novel aspect of this approach is that the team used images from a different site of the body (i.e., off-site organ) to calibrate machine learning classifiers for skin cancer diagnosis. This enabled them to minimize the risk of data leakage in addition to increasing the generalizability of the classifiers.

Regarding the generalizability of this approach, Dr. Madabhushi said that “while in this study, MuSClD was employed in the context of non-melanoma skin cancer, this approach could be extended to any domain or body region.”


Looking Ahead

Despite the promising diagnostic performance of MuSClD, several questions remain to be addressed in future studies. “In this study, we focused on skin cancer, and the logical question is whether the same machine learning framework can be extended to other indications. This approach was validated on data from a single test site, and the question is whether this approach will work equally well on multiple different test sites,” said Dr. Madabhushi.

“In this study, we used lung pathology images from the test site to calibrate a skin cancer classifier. In future work, we want to assess whether images from other body regions, such as the spleen and pancreas, might help to the same extent for calibration,” he added.



  1. Zhou Y, Koyuncu C, Lu C, et al. Multi-site cross-organ calibrated deep learning (MuSClD): Automated diagnosis of non-melanoma skin cancer. Med Image Anal. 2023;84:102702. doi:

Share This Post

Leave a Reply