Atezolizumab Assay Angst
A response to “Which Assay for Atezolizumab?” by Torlakovic and Gown
David Rimm | | Opinion
I appreciate the comments Drs. Gown and Torlakovic have made related to my essay. In fact, as a pathology resident, Allen Gown was a role model for me as an example of a successful scientist-pathologist. As such, I am honored to have attracted his attention with my essay. However, as is often the case among scientists, we interpret data differently, and thus I would take issue with the following statements from the Gown and Torlakovic response to my essay:
1. “Nonetheless, it is a truism that more training and experience with lead to more reproducible results.”
I believe this is false with respect to immunohistochemistry. No matter how well you train me to flap my arms, I will never fly. It is a “truism” that humans, without help from a machine, cannot fly. Similarly, there are tasks that cannot be done by the human eye, even with training. Reading immune cell PD-L1 expression may be such a task. The variability and non-uniformity of the staining pattern may be of sufficient complexity that it is impossible for human eyes to reproducibly score, especially with a multi-category scoring system. In fact, Tsao and colleagues spent 1.5 days training 25 pathologists who subsequently failed to reproducibly score immune cell PD-L1 (1). These were all very experienced pathologists, many of them leaders in the lung cancer field, but their interclass correlation coefficient (ICC) for reading immune cell scores for agreement were mostly in the 0.2 range – certainly unacceptable for a clinical assay. The data suggests that it is not a “truism” that more training and experience will lead to more reproducible results.
2. “And why is Rimm so exercised about this ‘limitation’ of the Roche Ventana PD-L1 (SP142) assay? The situation is identical with the FDA approved 28-8, 22c3, and SP263-based PD-L1 assays from various vendors, all using specific instruments.”
I think the limitation they are referring to is the closed-system nature of SP142. It is true that all of the assays are closed systems, and I appreciate the value of closed systems for reproducibility – although, when quantitatively assessed by intensity, their variability shows ±2 SD of variation (2) and other issues (3). Specifically, SP142 does not come with an accuracy standard, so there is no way to be sure that, for any given staining run, the signal threshold is accurate. Other PD-L1 assays can be simply compared with a second assay, because the Blueprint (1,4) and NCCN (5) studies have shown that they are comparable. Because the SP142 assay is not comparable in sensitivity to any other assay, there is no mechanism for standardization of its day-to-day reproducibility.
3. “It may be true that tumor cells are stained to a lesser degree with the SP142-based IHC assay, but the opposite is true for immune cells, which appear to be ‘overstained’ compared with other IHC PD-L1 assays; but it seems the intent of the assay was lower analytic sensitivity for tumor cells and higher analytic sensitivity for inflammatory cells – a goal that was achieved.”
This statement shows the success of the agenda of the Ventana marketing program but has been scientifically proven false by a number of studies (1,2,4,5). It seems that neither the authors nor the marketers at Ventana remember their college-level biochemistry. The protein interaction between an antibody and its antigen is governed by the laws of physics. Determination of the interaction’s affinity (binding strength) is defined by the concentration of bound ligand divided by the concentration of each of the free ligands (6). This interaction is independent of cell type; in other words, the antibody does not “know” if the PD-L1 is expressed in the immune cell or the tumor cell.
The only plausible biochemical explanation could be that there is post-translational modification of the epitope in tumor cells that does not occur in immune cells (or vice versa) – but the observation that staining is lower in cell lines, tumor cells, and immune cells suggests that there are no such post-translational modifications. When read by 13 pathologists on a statistically powered number of cases, the SP142 assay shows uniformly lower sensitivity in both tumor and immune cell scoring (5).
4. “… the most significant mistake in Rimm’s analysis is his confusion between analytic sensitivity and specificity and diagnostic sensitivity and specificity.”
By this, the authors refer to the selectivity of the assay to choose patient benefit in the IMpassion130 trial. This statement was correct in that only the SP142 assay was tested on the clinical trial tissue and its association with outcome was good. Statisticians inform us that the best way to test this association is to assess the hazard ratio (HR) and, indeed, the HR for SP142-positive cases for overall survival was 0.62 (95% CI 0.49–0.78) (7). However, Torlakovic and Gown may not be aware of the reanalysis of the data with two other assays and updated survival data. Hope Rugo presented a study at the fall meeting of the European Society of Medical Oncologists that showed the HR for SP142 had slipped to 0.74 (95% CI 0.54–1.01) showing that, in spite of a difference in median survival, that assay is no longer statistically significant.
She also presented the data on reassessment of the tissue slides with two other PD-L1 assays (22c3 and SP263). For SP263, the HR was 0.75 (95% CI 0.59–0.96). This assay has essentially the same HR but, because the upper bound is ≤1, the relationship is statistically significant – even though the SP263 assay found 75 percent of the population positive, compared with SP142’s 46 percent. This data proves that the SP263 assay is as good (or arguably better) than SP142 with respect to diagnostic sensitivity and specificity.
5. “We find [the SP142 assay] both simple and highly reproducible, paralleling the assessment of this assay in a study (not cited by Rimm) involving six pathologists and three sites.”
I did not cite the study by Vennapusa and colleagues (8) because, although the work is beautifully illustrated, it is statistically underpowered and untested (no ICCs or p values). Thus, it presents an agenda-based truth, rather than a fact-based truth supported by scientific methods. But this brings to light a puzzling conundrum. How can the FDA approve an assay with overall percent agreement (OPA) in the 0.95 range when ICCs on the same assay in statistically powered comparison studies with multiple pathologists (1) are in the 0.2 range? We recently presented a potential explanation for this conundrum at the San Antonio Breast Cancer Symposium. We note that, similar to Vennapusa and colleagues, the FDA required comparison of each observer to a consensus score – essentially two observers: the test pathologist and the consensus score.
So, we ask the question, “How many Observers are Needed to Evaluate a Subjective Test?” Using the capitalized letters as an eponym, we draw an ONEST plot that graphs the OPA on the y-axis vs the number of observers (pathologists) on the x-axis. There are over a billion combination comparisons – far too many to plot them all – so we plot 100 random representative comparisons (see Figure 1). One can see that the range for any two observers goes from an OPA of 0.55 to 0.95, while no combination of 10 or more observers exceeds an OPA of 0.6 and the OPA for 19 observers is 0.41. We believe this approach shows why there can be discordance between multiple studies reporting agreement between pathologist observers. Furthermore, we believe that the plot’s plateau shows the minimum number of observers that should be included in construction of a subjectively read assay. By the time you read this, we will have had our first meetings with the FDA to discuss this issue. Furthermore, a manuscript carefully describing this ONEST approach is currently under review. In the real world, where well over 1,000 pathologist observers read this assay, we believe the ONEST plot plateau is most representative of the true OPA of the assay.
I thank the editor of The Pathologist for publishing this response and I look forward to future discussions, perhaps in person, with Torlakovic and Gown to further explore this issue. It is my hope that the scientific facts will prevail, and that the assays that most accurately select patients for response to immunotherapy will soon be better standardized for accurate and reproducible use by pathologists.
- MS Tsao et al., “PD-L1 immunohistochemistry comparability study in real-life clinical samples: results of Blueprint Phase 2 Project,” J Thorac Oncol, 13, 1302 (2018). PMID: 29800747.
- S Martinez-Morilla et al., “Quantitative assessment of PD-L1 as an analyte in immunohistochemistry diagnostic assays using a standardized cell line tissue microarray”, Lab Invest, 100, 4 (2020). PMID: 31409885.
- CC Cheung et al., “Uneven staining in automated immunohistochemistry: cold and hot zones and implications for immunohistochemical analysis of biopsy specimens”, Appl Immunohistochem Mol Morphol, 26, 299 (2018). PMID: 29734239.
- FR Hirsch et al., “PD-L1 immunohistochemistry assays for lung cancer: results from Phase 1 of the Blueprint PD-L1 IHC Assay Comparison Project”, J Thorac Oncol, 12, 208 (2017). PMID: 27913228.
- DL Rimm et al., “A prospective, multi-institutional, pathologist-based assessment of 4 immunohistochemistry assays for PD-L1 expression in non-small cell lung cancer”, JAMA Oncol, 3, 1051 (2017). PMID: 28278348.
- SA Hunter, JR Cochran, “Cell-binding assays for determining the affinity of protein-protein interactions: technologies and considerations”, Methods Enzymol, 580, 21 (2016). PMID: 27586327.
- P Schmid et al., “Atezolizumab and nab-paclitaxel in advanced triple-negative breast cancer”, N Engl J Med, 379, 2108 (2018). PMID: 30345906.
- B Vennapusa et al., “Development of a PD-L1 complementary diagnostic immunohistochemistry assay (SP142) for atezolizumab”, Appl Immunohistochem Mol Morphol, 27, 92 (2019). PMID: 29346180.