How to analyse candidates comparative diagnostics

A representation of data from a paired comparative diagnostic accuracy study is given in Table 1. The subjects are initially divided according to whether they are discovered, via the gold standard test, to be diseased or non-diseased. They are then further subdivided as to whether they test positive or negative on tests A and B. For example, the cell n A represents subjects that were found to have the disease via the gold standard test and also tested positive on both test A and B, while cell n F denotes subjects who tested negative on the gold standard and test B but positive on test A.

Table 1 Paired study design

A possible initial sample size calculation, using a normal approximation of the logarithm of the ratio of sensitivities and specificities, and assuming a comparison between a new test, test A, and an existing test, test B, follows from Alonzo et al. [21] and a full derivation can be found therein. The experiment, as a whole tests jointly both sensitivity and specificity improvement to pre-specified levels, the sample size is calculated for each and the largest sample size is chosen to power the study. Note that this paper concentrates on the situation in which superiority is tested for both sensitivity and specificity. However, the method elaborated below should be extendable to situations where we are interested in testing non-inferiority in either or both of sensitivity and specificity. For details on the construction of the confidence intervals and hypothesis tests in these situations see Alonzo et al. [21]. In the case of the estimation of a sample size for superiority, the initial sample size calculation for sensitivity is given by:

$$ {n}_{p1}={\left(\frac{Z^{\left(1-\beta \right)}+{Z}^{\left(1-\alpha /2\right)}}{\mathit{\log}{\gamma}_1}\right)}^2\left(\frac{\left({\gamma}_1+1\right){TPR}_B-2 TPPR}{\gamma_1{TPR}_B^2}\right)/\pi $$

(1)

where, α is the type I error rate of the study and β is the power of the study. The main quantity of interest, γ 1, is the ratio of true positive rates=TPR A /TPR B , TPR B is the true positive rate (sensitivity) on test B, i.e. TPR B  = (n A  + n C ) / (n A  + n B  + n C  +  n D ), TPR A is the true positive rate (sensitivity) on test A, i.e. TPR A = (n A  + n B ) / (n A  + n B  + n C  +  n D ), TPPR is the proportion of diseased patients who test positive on both tests, i.e. TPPR = n A /( n A  + n B  + n C  +  n D ) and π is the prevalence of disease. The null hypothesis is that γ 1 = 1, the alternative hypothesis is that γ 1≠1.

For testing superiority of specificity we are interested in the true negative rates so the formula is instead:

$$ {n}_{n1}={\left(\frac{Z^{\left(1-\beta \right)}+{Z}^{\left(1-\alpha /2\right)}}{\mathit{\log}{\gamma}_2}\right)}^2\left(\frac{\left({\gamma}_2+1\right){TNR}_B-2 TNNR}{\gamma_2{TNR}_B^2}\right)/\left(1-\pi \right) $$

(2)

where, γ 2, the main quantity of interest is the ratio of true negative rates =TNR A /TNR B , TNR A is the true negative rate (specificity) on test A = (n G  + n H ) / (n E  + n F  + n G  +  n H ), TNR B is the true negative rate (specificity) on test B = (n F  + n H ) / (n E  + n F  + n G  +  n H ), and TNNR is the proportion of non-diseased patients who test negative on both tests = n H /( n E  + n F  + n G  +  n H ).

It is interesting to note that, following the notation of Vacek [25] and considering the population 2 × 2 table (in Table 1), the conditional dependence of the two tests can be denoted by e b and e a. , the conditional covariance when the gold standard disease status is positive or negative, respectively [25]. Therefore, the probability of both tests being positive can be expressed as TPPR = TPR A  ∙ TPR B  + e b and the probability of both tests being negative TNNR = (1 − TNR A ) ∙ (1 − TNR B ) + e a . When e a and e b  = 0 the tests are conditionally independent, when e a and/or e b  ≠ 0 the response on one test changes the probability of that response on the other test. For example, when e b  > 0 an individual who responds positively on test A is more likely to respond positively on test B.

For initial estimates of TPPR and TNNR, from Alonzo et al. [21] we can use the fact that TPPR ≥ (1 + γ 1)TPR B  − 1 and TNNR ≥ (1 + γ 2)TNR B  − 1 to estimate the lower bounds of the possible values of TPPR and TNNR, under the specified hypotheses. The required sample size is largest when TPPR = (1 + γ 1)TPR B  − 1 and TNNR = (1 + γ 2)TNR B  − 1, thus, these estimates represent the “worst case scenarios” of maximal negative conditional dependence between the tests, conditional on the fixed values of TPR A and TPR B . The sample size implied by using these levels of TPPR and TNNR would very likely overpower the study, i.e. more participants will be recruited than is strictly necessary to achieve the power specified by β. The required sample size is smallest when the conditional dependence between tests A and B are maximal, conditional on the fixed values of TPR A and TPR B , i.e. when TPPR = TPR B and TNNR = TNR B . The implied sample size in this case would likely underpower the study, i.e. too few participants recruited to reach the power specified by β. The sample size in this “best case scenario” can be substantially lower than that in the worst case scenario. Conservatively, it might be thought a good idea to always use the “worst case scenario” implied sample size estimate which will always power the study sufficiently. However, in cases where the recruitment and testing of participants comes at a premium, both financially and in terms of discomfort to the patients, it might be preferable to apply a more nuanced strategy. Furthermore, the sample size implied by the “worst case scenario” implies the highly unlikely condition of a maximal negative conditional dependence between two tests, which are performed on the same patients to detect the same disease. The implied sample size based on this condition is not recommended [28]. One possibility, to enable a more accurate evaluation of the conditional dependence between the two tests, and thus the required sample size, is to perform a planned interim sample size re-estimation using this information to refine the sample size estimate.

At a planned interim, where a proportion of the overall sample size has been collected, we would have some information about the true values of TPPR, TNNR, π, TPR B and TNR B , however, these values would only come from a limited sample size. The crucial parameters to use in re-estimation are those related to the conditional dependence between the tests, i.e., TPPR and TNNR, as these values are difficult to estimate and, for these parameters, it is unlikely that research exists which can provide an approximate value. Conversely, the values of, TPR B and TNR B , the sensitivities and specificities of an established test, may have known values in the literature and these should preferably be used over those from the relatively small interim sample. For the value of π,the prevalence, a judgement must be made as to whether the researcher feels that any pre-existing estimate of prevalence would be a more accurate reflection of the true prevalence in the specific study population than any interim estimate. In the example given below, we use values for TPPR, TNNR and π at the interim in the sample size calculation.

Naively, it might appear that interim sample size re-estimation would entail a straightforward replication of eqs. (1) and (2) with π, and in the case of (1), TPPR or in the case of (2), TNNR, replaced with the estimates at the interim point. However, this approach does not effectively take into account the inherent uncertainty in the interim parameter estimates of TPPR, TNNR and π, nor the fact that only a specific range of values for TPPR and TNNR are actually possible under the alternative hypothesis. An approach which does take these factors into account is re-estimation of the sample size based on maximum likelihood estimation, at the interim, of the parameters in question under a multinomial model. This model is constrained by the hypothesised values of TPR A  , TPR B , TNR A , and TNR B , i.e. the marginals in Table 1.

Application

The numerical example we use involves an interim sample size recalculation of a study comparing the incremental benefits to sensitivity and specificity of augmenting current methods for diagnosing pancreatic cancer with Positron Emission Tomography (PET) and computed tomography (CT) technologies. The alternative hypotheses were that sensitivity would rise from 81% to 90%, and specificity would rise from 66% to 80%, additionally, the expected prevalence of pancreatic cancer from the literature was 47%.

To calculate the sample size for sensitivity equation 1 was used, taking \( \alpha =0.05,\kern0.5em \beta =0.2,\kern0.5em {\widehat{\gamma}}_1=\frac{0.9}{0.81}, \) \( \widehat{TPR_B}=0.81, \) \( \widehat{TPPR}=0.71 \), and \( \widehat{\pi}=0.47 \) gives a sample size of 598. To calculate the sample size for specificity equation 2 was used taking \( \alpha =0.05,\kern0.5em \beta =0.2,{\ \widehat{\gamma}}_2=\frac{0.8}{0.66}, \) \( \widehat{\ {TNR}_B}=0.66, \) \( \widehat{TNNR}=0.46 \), and \( \widehat{\pi}=0.47 \) gives a sample size of 409. The minimum sample sizes for sensitivity and specificity, given \( \widehat{TPPR}=0.81 \) and \( \widehat{TNNR}=0.66 \), are 186 and 106, respectively. Given the disparity between the minimum and maximum sample size estimates it was decided to re-assess the sample size at a planned interim.

Table 2 gives the results after data from 187 participants had been collected. The observed values at the interim are: \( \widehat{TPPR}=0.80 \), \( \widehat{TNNR}=0.66 \) and \( \widehat{\pi}=0.44 \). Taking a naive approach and plugging these values directly into equations 1 and 2 the implied sample sizes for sensitivity become 242 and for specificity 100, giving a total sample size for the study of 242 (or 342 and 145, respectively, had we also used the interim values of TPR B and TNR B ). However, this method does not take into account the fact that \( \widehat{TPPR} \) and\( \widehat{\ TNNR} \) are random variables and we are actually interested in the true value of the probability of TPPR and TNNR under the specified alternative hypothesis. In fact, had the observed value for TPPR been equal to 0.86, the sample size given via the naive method would have been −22, given the fact that \( \widehat{TPPR} \) would have been larger than both TPR A and TPR B . Clearly, the naive method, which uses the random value of a single cell, is inappropriate and a method that uses information about the value of TPPR from all of the observed cells and the specified marginals is required.

Table 2 Interim PET diagnostic study results

For illustration purposes, we will discuss the re-estimation of the sample size for sensitivity, the estimation procedure for specificity is analogous. Taking TPR A as the test with the highest expected diagnostic utility, i.e. the “new” test whose performance we are comparing to the “standard”, the probabilities corresponding to the cells in Table 1, given the situation of the maximally negative conditional dependence between the tests are: p 1 = TPR B  − (1 − TPR A ), p 2 = 1 − TPR B , p 3 = 1 − TPR A , p 4 = 0. The probabilities of the cells when the conditional dependence between TPR A and TPR B is at its maximally positive are given by: p 1 = TPR B , p 2 = TPR A  − TPR B , p 3 = 0, p 4 = 1 − TPR A . We could alternatively specify these cell probabilities according to the covariance between the two tests. Specifically, Vacek [25] gives the maximum value of the covariance as TPR B  (1 − TPR A ) and the minimum value as −(1 − TPR A )(1 − TPR B ). Thus, the maximum and minimum values for the cells can be ascertained by finding the product of the marginal probabilities associated with a cell and adding the minimum or maximum value of covariance, for cells p 1 and p 4, or subtracting the values of covariance for cells p 2 and p 3. For example, the minimum value for p 1 = TPR A  ∙ TPR B  − (1 − TPR A )(1 − TPR B ). Between the minimum and maximum values lies every permissible joint configuration. Let these possible joint configurations be expressed as vector, p, with p 1 = TPPR,where \( {\sum}_{i=1}^4{\mathbf{p}}_i=1,{\ p}_1+{p}_2 = {TPR}_A \) and p 1 + p 3 = TPR B .

When the conditional dependence is maximally positive the sample size required is the smallest, when it is maximally negative the sample size required is at its largest. At the beginning of the experiment we do not know which of these possible levels of conditional dependence our data were generated under and thus we use the, usually overly conservative, largest possible sample size estimate.

However, at the interim we can use our observed data to infer a likelihood of that data having been generated under each of the permissible joint configurations of cell probabilities given the implied range of probabilities under a multinomial model. A simple method of extracting an estimate of TPPR is to maximise the likelihood function of the interim data given the values of p implied by the marginal probabilities:

$$ \mathcal{L}\left(\boldsymbol{p}| x\right)=\kern0.5em \prod_{i=1}^4{\boldsymbol{p}}_i^{x_i} $$

(3)

where p is the vector of joint probabilities defined above and x are the observed cell frequencies. The constraints imposed on the above multinomial likelihood make the parameter space one dimensional, thus, substituting the constraints in order to express the likelihood in terms of p 1, gives:

$$ \mathcal{L}\left({p}_1| x\right)={p}_1^{x_1}{\left(\ {TPR}_A - {p}_1\right)}^{x_2}{\left(\ {TPR}_B - {p}_1\right)}^{x_3}{\left(1 - {TPR}_A - {TPR}_B + {p}_1\right)}^{x_4} $$

(4)

$$ {\ p}_1\in \left[\ {TPR}_B-\left(1 - {TPR}_A\right),{\ TPR}_B\right] $$

Code to estimate this in R, via optimisation of the negative log-likelihood, is in the Appendix. In effect, this method bounds the value for the conditional dependence between the minimum and maximum values under the specified marginals and then uses information from the frequency values of the four cells of the table to infer the most probable value of p 1. We can use this estimate of\( {\ \widehat{p}}_1 \) as our value of \( \widehat{TPPR} \) and use the observed value of the prevalence (if required) as our measure of \( \widehat{\pi} \) in equation 1 to re-estimate the sample size at the interim.


Page 2

Skip to main content

From: Sample size re-estimation in paired comparative diagnostic accuracy studies with a binary response

  Mean sample size Bias
N = 50 N = 100 N = 150 N = 200 N = 50 N = 100 N = 150 N = 200
TPPR
0.81 217(77) 202(35) 198(23) 205(17) −0.00091 −0.00027 0.00018 0.00048
0.80 256(114) 241(71) 238(56) 241(48) −0.00031 0.00035 0.00062 0.00064
0.79 297(139) 283(92) 281(72) 282(62) −0.00007 0.00069 0.00072 0.00068
0.78 338(155) 326(105) 325(83) 325(70) 0.00045 0.00056 0.00082 0.00062
0.77 381(166) 371(114) 369(89) 369(75) 0.00043 0.00054 0.00058 0.00050
0.76 423(170) 415(118) 413(92) 413(78) 0.00054 0.00035 0.00054 0.00041
0.75 465(171) 460(118) 457(93) 456(79) 0.00069 0.00056 0.00029 0.00033
0.74 506(166) 503(115) 501(91) 500(78) 0.00029 0.00028 0.00031 0.00031
0.73 546(156) 546(107) 545(86) 543(73) 0.00047 0.00045 0.00022 0.00022
0.72 585(143) 588(95) 88(76) 586(65) 0.00043 0.00027 0.00017 0.00022
0.71 621(124) 629(75) 630(59) 629(50) 0.00024 0.00037 0.00033 0.00019
  Coverage RMSE
N = 50 N = 100 N = 150 N = 200 N = 50 N = 100 N = 150 N = 200
TPPR
0.81 0.923 0.925 0.924 0.923 80 36 23 18
0.8 0.936 0.937 0.936 0.936 115 71 62 48
0.79 0.942 0.943 0.944 0.943 140 92 72 62
0.78 0.947 0.947 0.947 0.946 156 105 80 70
0.77 0.948 0.948 0.949 0.947 166 114 89 75
0.76 0.949 0.950 0.950 0.949 171 118 92 78
0.75 0.950 0.950 0.949 0.950 171 119 93 79
0.74 0.950 0.950 0.950 0.950 166 115 91 78
0.73 0.950 0.951 0.951 0.951 156 107 86 73
0.72 0.951 0.949 0.950 0.951 143 95 76 65
0.71 0.949 0.949 0.950 0.950 124 75 59 50