Like most websites The Pathologist uses cookies. In order to deliver a personalized, responsive service and to improve the site, we remember and store information about how you use it. Learn more.
Outside the Lab Profession, Clinical care, Quality assurance and quality control, Regulation and standards

Quis Custodiet?

My group is interested in methods of establishing whether diagnostic tests are really useful – that is, whether patients benefit from them. This is an under- developed area, as it’s always been easier to establish the utility of a new drug than it is to assess whether patients actually benefit from a test.

At a Glance

  • Evaluation of diagnostic tests receives far less attention than, for example, pharmaceuticals; as a result, tests receive less regulatory scrutiny and barriers for marketing are low 
  • Analytical performance, clinical performance and clinical effectiveness of a test should be considered in the context of its intended application 
  • Unfortunately, many published evaluations of medical tests do not include the key information necessary for a valid assessment of the test, and indeed may be over- optimistic in the interpretation of the data they report 
  • The STARD checklist is changing this situation by helping readers, authors and editors to verify that critical data are included in reports of the evaluations of medical tests 
Fit for purpose? 

One of the main problems is that patients don’t directly benefit from the test; the benefit typically arises from how the test results are used to guide clinical management. That’s different from other interventions, such as drugs, where there is a direct link between the intervention and the patient outcome. So to evaluate a medical test, you have to understand how it is used in the clinical pathway – how the results are communicated and how they guide decisions and influence outcomes. But lab professionals usually focus on the tests and the results, and don’t always consider how these results affect clinical management and patient outcomes. 

Another factor is that the medical testing field receives far less attention than pharmaceuticals and other interventions; for example, it is given scant attention during the training of medical doctors. In particular, it receives less emphasis from a regulatory and reimbursement perspective, and in consequence the barriers for marketing are worryingly lower for medical tests than they are for pharmaceuticals. There is accordingly less pressure on companies to produce direct evidence that their tests improve outcomes. Indeed, if a new test provides the same results as an existing one, it can replace the existing test without going through the expense of a randomized trial. So the need for a more complex evaluation – one that requires consideration of patient outcomes – in combination with lower regulatory hurdles has resulted in the field of medical test evaluation being slightly disadvantaged. 

A pragmatic approach 

This is not to say that every new test should be validated by a randomized trial – that would be an extremely exaggerated position to take – but we should consider a middle ground. Clearly, some personalized and precision medicine tests really must be evaluated in randomized trials before the clinical community can accept that they are effective in improving patient outcomes. For example, if you propose using a marker to identify patients likely to respond to therapy, you should provide evidence that marker-based stratification in combination with treatment gives better outcomes than the alternatives (i.e. either no treatment or treatment of all patients). But many other tests could legitimately rely only on diagnostic or analytical accuracy studies for approval. So I’d like to see a staged approach in which some types of test require randomized trial evidence while others do not.

There is accordingly less pressure on companies to produce direct evidence that their tests improve outcome.

In any case, however, the methods by which tests are evaluated deserve close consideration. Typically, one assesses three features of a lab test: analytical performance, clinical performance and clinical effectiveness. Evaluation of analytical performance determines the trustworthiness of the test – does it reliably give results that correspond with the true value? Evaluation of clinical performance indicates whether the test result is meaningful – does it distinguish diseased from non-diseased patients, or patients who progress from those that don’t? But the bottom-line evaluation is that of clinical effectiveness – is the test useful, that is, does it guide management better than not relying on the test? So these three types of evaluation answer different questions, and each should be answered satisfactorily before releasing a test into clinical care. 

Furthermore, each of these three evaluations should be guided by the intended application of the test, in particular with regard to the kind of patients who are to be tested in clinical practice. Thus, there is some interdependence between these three types of evaluation. For example, the evaluation of analytical performance should be connected to the clinical effectiveness evaluation, in that once the test is sufficiently developed, you should anticipate the required level of clinical effectiveness in the context of the intended use of the test. So a proper evaluation should take into account the link between the test and its intended application in healthcare. 

The test of time? 

Unfortunately, medical test evaluation suffers from silo thinking – evaluations of analytical performance and clinical performance are disconnected. Often, manufacturers and healthcare professionals develop tests without a proper understanding of how the test will provide a level of effectiveness sufficient to persuade payers or clinicians to use the test in the first place. And this is not the only problem in the field of medical test evaluation. Like other areas of biomedical research, the test evaluation literature suffers from widespread problems, such as failure to report negative results and presentation of results with an optimistic ‘spin’. Commonly, this is a result of researchers feeling that they must emphasize positive findings in order to achieve publication. And this, in combination with the fact that some studies will generate positive findings by chance or through poor study design, may explain why reports of medical test performance are not replicable, or differ significantly from performance estimates in systematic reviews and meta-analyses. But by the time systematic reviews have been generated, it may be too late, in that the initial positive and encouraging publications have received a lot of attention, and may be prompted by the premature introduction of the test into clinical care. Of course, introduction of suboptimal tests is hardly ever due to fraud or misconduct; rather it is a reflection of the fact that we have an abundance of studies that are not always well-conducted and which present their results in an over- optimistic way. And this contributes to waste in medical research, because these findings generate other studies which then fail to replicate the initial finding – and which could have been avoided if the prevailing culture supported publication of negative findings as well as positive findings.

Figure 1: STARD flowchart The STARD flowchart and checklist tools (1) have been developed to support the robust evaluation of medical tests intended for various purposes (diagnosis, screening, staging, monitoring, surveillance, prediction and prognosis), and for various clinical roles (for example, to use before an existing test, to replace an existing test, or to use after an existing test). A typical study evaluates the ability of a medical test (the “index test”) to correctly classify study participants as having a target condition – for example, disease presence, disease stage or response to therapy. This is usually done by comparing the distribution of the index test results with those of the “reference standard”, i.e., the best alternative method for establishing the presence or absence of the target condition. Cross-tabulation of index test results against those of the reference standard can be used to estimate the sensitivity of the index test (the proportion of participants with the target condition who have a positive index test), and its specificity (the proportion without the target condition who have a negative index test). This approach permits derivation of other statistical measures, such as the positive and negative predictive values of the test. Confidence intervals around estimates of accuracy can then be calculated to quantify the statistical uncertainty of the measurements. Standardization of the evaluation of medical tests through wider use of STARD tools, such as the flowchart illustrated, is expected to improve the quality and reliability of reported evaluations of medical tests.
The test evaluation literature suffers from widespread problems, such as failure to report negative results and presentation of results with an optimistic ‘spin’.
Don’t be blinded by spin 

So how can we improve the evaluation of medical tests? I believe that there are a number of improvements that can be made. Clearly, it would help if editors and peer-reviewers were better trained to recognize spin, but it’s a difficult situation – the journal’s reputation is built on the number of citations it gets, so it too is incentivized to publish optimistic findings. More fundamentally, I think that we should raise awareness among manufacturers that useful tests must improve patient outcomes – and that manufacturers are partially responsible for providing the evidence that tests are not only accurate but also useful. This would encourage manufacturers to undertake more trials and effectiveness studies than they currently do. 

Another helpful advance would be to increase the understanding and appreciation of the methods for evaluating medical tests. This would benefit not only statisticians – who are usually less familiar with methods for evaluating medical tests than they are with methods for evaluating randomized trials of drugs – but also many healthcare professionals, whose understanding of medical tests is not as extensive as it should be and who would benefit from additional training. For example, clinicians often have high expectations for precision of medical tests but only limited understanding of intrinsic variability, and a poor appreciation of the links between test results, management actions and patient outcomes. 

Finally, I expect additional changes to be forced by the growing reluctance among healthcare payers to support expensive drugs and automatically reimburse new markers and new tests. Increasingly, payers are expecting proof of effectiveness for these new tests before they are willing to support their use, especially if they are expensive multimarker panels or completely new forms of testing like clinical mass spectrometry. And that implies that the community will have to improve its understanding of the link between medical tests and management, and between management and patient outcomes.

In the short-term... 

In the near-term, however, there are pragmatic actions we can take that will improve the quality of reports of the performance of medical tests. Our group, like several others, has found that crucial items of information are often lacking in published studies. For example, the age and gender of the study subjects may not be disclosed, details of how and where subjects were recruited may be omitted, and the actual results may not be provided in full; for example, the report might give no more than the number of subjects and the percent of correct classifications provided by the test! That’s why I have collaborated with an international group of researchers, editors and authors to develop a one-page checklist; the intent is to help verify that essential information is included in studies reporting the results of medical test evaluations. That checklist, known as STARD (from ‘standards for reporting diagnostic accuracy studies’; see Figure 1), was initially published in 2003, and updated in October last year (1). 

STARD can be used by authors to verify that they have included all the essential information in the study report, and by peer reviewers, editors and readers to see if the study report can validly answer the question of whether the evaluated test is useful for a specific application. Since the initial release of STARD, we’ve seen a small but significant increase in the number of reported items per study, and we’re very much encouraged by the fact that some journals have started to use STARD systematically in their peer review process for evaluations of medical tests. We’re not there yet – I want to see STARD adopted even more broadly – but we are at least starting to make a difference. 

Patrick Bossuyt is Professor of Clinical Epidemiology at the University of Amsterdam, the Netherlands.

Enjoy our FREE content!

Log in or register to gain full unlimited access to all content on the The Pathologist site. It’s FREE and always will be!


Or register now - it’s free and always will be!

You will benefit from:

  • Unlimited access to ALL articles
  • News, interviews & opinions from leading industry experts
  • Receive print (and PDF) copies of The Pathologist magazine

Or Login via Social Media

By clicking on any of the above social media links, you are agreeing to our Privacy Notice.

  1. PM Bossuyt et al., “STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies”, BMJ, 351, h5527 (2015). PMID: 26511519. Also accessed in full at

About the Author

Patrick Bossuyt

Patrick Bossuyt is Professor of Clinical Epidemiology at the University of Amsterdam, the Netherlands.

Register to The Pathologist

Register to access our FREE online portfolio, request the magazine in print and manage your preferences.

You will benefit from:

  • Unlimited access to ALL articles
  • News, interviews & opinions from leading industry experts
  • Receive print (and PDF) copies of The Pathologist magazine