Researchers Develop Comprehensive Evaluation Toolkit for Deep Learning in Pathology

by Christos Evangelou, MSc, PhD – Medical Writer and Editor

Although deep learning has shown great potential for improving the accuracy and efficiency of histopathological analysis, there is a lack of standardized evaluation protocols and comprehensive evaluation frameworks for patch-based histopathological classification. This makes it challenging to compare and identify the best-performing models, hindering the progress of deep learning applications for histopathology.

In a recent study, researchers at Stony Brook Medicine and Cold Spring Harbor Laboratory developed a comprehensive, user-friendly evaluation toolkit called ChampKit to facilitate systematic evaluation and benchmarking of patch classification models and identification of the model that performs the best on specific histopathology datasets.1

Using the ChampKit tool, the authors were able to systemically and comprehensively test various neural networks across six histopathology datasets. In most of these datasets, random weight initialization and transfer learning displayed similar performance. The findings of the study provide proof of concept that ChampKit is an easy-to-use and reproducible tool that enables researchers and pathologists to evaluate hundreds of deep learning models across a variety of pathology tasks.

“There is strong interest in clinical applications of deep learning in pathology. Our findings strongly suggest that model developers must explore various models and hyperparameters to arrive at the best model,” said Peter Koo, PhD, assistant professor at Cold Spring Harbor Laboratory and corresponding author of the study.

“In our study, some models performed very poorly in some tasks and necessitated a search over multiple configurations,” he added.

The report was published in the journal Computer Methods and Programs in Biomedicine, and the source code for the tool is freely accessible on GitHub.

Study Rationale: Addressing Challenges of Identifying Optimal Models and Training Configurations for Different Histopathology Classification Tasks

Histopathology plays a key role in the diagnosis of various cancers. Recent advances in deep learning have paved the way for the comprehensive characterization of tumors using histopathology images, including the quantification of certain cell populations and the identification of microsatellite instability.

However, the lack of tools to identify robust training configurations and optimal models for different histopathological classification tasks hinders the widespread adoption of deep learning in routine histopathology practice.

“In our experience with creating patch-based classification models for histology, we realized that as model developers, we are faced with many choices,” noted Dr. Koo. He explained that for any given network architecture, there are numerous knobs and switches that might — or might not — affect model performance.

“We conducted this study because we wanted to improve the experience of model developers by creating a one-stop-shop toolkit that enables a search over model architectures, hyperparameters, and other configurations,” he added.

ChampKit: A Tool for Comprehensive Evaluation of Patch-based Classification Models for Histology

To address the challenges of identifying optimal models and training configurations for different histopathology classification tasks, the research team developed ChampKit (Comprehensive Histopathology Assessment of Model Predictions toolkit) as an easy-to-use and light-weight tool that enables robust, systematic evaluation of neural network models for patch classification in histology.1

Evaluation of neural network models for patch classification using ChampKit is extensible and fully reproducible. For model training, the toolkit curates a large number of publicly available datasets.

“We created a Python-based toolkit that includes a model training script, evaluation script, and a set of six public benchmarking datasets that we believe represent important tasks in digital pathology. Our toolkit builds on PyTorch, the TIMM model repository, and Weights & Biases, which enable access to the latest image classification models, transfer learning from ImageNet, and rich logging of model performance,” said Jakub Kaczmarzyk, first author in the study and an MD/PhD researcher at Stony Brook Medicine.

Integration of these machine learning frameworks and model repositories ensures that researchers and pathologists can use the toolkit to train and evaluate existing and new patch-based classification models and deep learning architectures without the need to write code.1

“The most novel aspect of this work is that ChampKit is a one-stop shop for model development and benchmarking in histological patch classification. Our ultimate goal is for people to use ChampKit in their work, and towards that, we made every effort to develop a comprehensive yet user-friendly toolkit,” Kaczmarzyk noted.

In addition to making the toolkit freely available on GitHub, the authors also provided detailed documentation on how to use the toolkit in an effort to maximize its use by the digital pathology community.

ChampKit Enables the Evaluation of Hundreds of Deep Learning Models Across Various Pathology Tasks

To demonstrate the usefulness of ChampKit, the team established baseline performance for commonly used deep learning models, including R26-ViT, ResNet18, ResNet50, and a hybrid vision transformer.1 They also compared each model trained either from random weight initialization or with transfer learning from ImageNet pre-trained models.

The team used ChampKit to systemically evaluate multiple neural networks across six datasets and observed mixed results regarding the benefit of employing pretraining and random initialization.

Although transfer learning could improve classification performance, its usefulness was inconsistent across tasks. Transfer learning from self-supervised weights rarely enhanced performance. However, transfer learning from ImageNet pre-trained models improved model performance in the low data regime.

“Our study suggests that of the models tested, there is no one model that performs the best for all tasks and datasets. To find the optimal model and training configuration for a specific dataset, one must search over different architectures and hyperparameters, and we developed ChampKit to streamline this process,” said Kaczmarzyk.

Looking Ahead

Although the findings of the study suggest that ChampKit is a valuable tool that can help pathologists choose a suitable model for a given digital pathology dataset by enabling the evaluation of hundreds of existing or new deep learning models across a variety of pathology tasks, future improvements are needed to expand its applicability.

The study’s baseline comparisons were limited to three network architectures with different pre-trained weights, which represents a small sample of available architectures accessible to ChampKit. Future work is needed to expand the range of baseline comparisons to include a wider variety of network architectures.

Commenting on future efforts to expand the range of network architectures, Dr. Koo said: “One future study we would love to see is a large-scale benchmarking analysis of many deep learning models across multiple pathology datasets. This type of study could help establish best practices for future studies, but it may be challenging to accomplish. We hope that ChampKit will enable these studies in the future.”

In addition, all models in the study were trained using identical hyperparameters for each dataset to ensure fair comparisons. Further optimization of each model could potentially improve performance, and future work is needed to explore the impact of hyperparameter tuning on model performance.

“Feedback is extremely important to us, and we encourage the community to submit bug reports, feature requests, and other feedback via our GitHub repository,” Dr. Koo added.


  1. Kaczmarzyk JR, Gupta R, Kurc TM, Abousamra S, Saltz JH, Koo PK. ChampKit: A framework for rapid evaluation of deep neural networks for patch-based histopathology classification. Comput Methods Programs Biomed. 2023;239:107631. doi:10.1016/j.cmpb.2023.107631

Share This Post

Leave a Reply