The tools to develop AI models for biomedical image analysis have recently become accessible also for non-computer scientists. With the accessibility to AI tools, the question arises whether the things we build are good enough?
How do we check the model?
How do we validate it and be sure that when deployed according to the intended use it will perform adequately and help us make the right decisions based on the correct premises?
In this episode, Thomas Westerling-Bui from Aiforia explains the validation principles that should be applied to AI image analysis solutions.
The AI image analysis model validation is like any other assay validation. It starts with finding out the boundaries of the assay’s usability. As for any assay, also in the case of an AI model its precision and recall are the most important parameters we want to check. We need to perform a conceptual validation and find out if the platform used does what we want it to do and an analytical validation to precisely quantify the accuracy of the method.
Validation is different from improving the AI model on a given data set and always needs to be performed on an independent data set. Unfortunately, there seems to be confusion about that in the scientific community which weakens many of the biomedical publications describing the development and use of AI models.
Another important concept – the intended use, is crucial not only for the use of the assay but also for its validation. The validation of a screening tool will be performed differently than the validation of a diagnostic tool.
As powerful as they are, the AI-based tools are just tools and will not do the things they are not designed (trained) to do so the validation should be tailored to the things they ARE trained to do.
As supervised AI methods rely on human-generated ground truth both for training and for validation the decision of how many validation regions to include depends heavily on the human capacity to provide adequate ground truth – in the case of image analysis it often includes annotations. If the users are pressed to generate a large number of annotations, precision may suffer so a middle ground needs to be found to provide an adequate number and maintain precision.
Another important aspect of generating ground truth is interobserver variability. It needs to be quantified and accounted for during the validation, which is why comparing model outputs against ground truth generated by just one individual is of limited value.
In a nutshell, the subject is complex, and to understand these and other nuances of AI model validation the following resources may be of use: