Interpretation of the Confusion Matrix

Actual Positive, Predicted Positive = TP

Actual Positive, Predicted Negative = FN

Actual Negative, Predicted Negative = TN

Actual Negative, Predicted Positive = FP

We know from the sample,

Actual Positive = TP + FN

Actual Negative = TN + FP

  1. Prevalence: If the sample can present the population, then prevalence is the ratio: Actual positive / sample number.

The population prevalence of a disease is defined as the probability that a randomly selected person from this population will have the disease. As the table shows, the prevalence is estimated by (TP + FN)/(TP + FP + FN +TN) = nD/n. For the Victoria BC data the prevalence is 665/96420 = 0.0069. This is a valid estimator only if the table is a summary of a representative sample of the population under analysis.

In other words, the sample should have been taken at random and the tabulation made subsequently. This is not the case in many studies. The prevalence of some diseases in a general population is often so small that insisting on a random sample would require huge sample sizes in order to obtain a nonzero TP or FN table entries. When the table is made from available cases and controls (convenience samples), the prevalence for the population cannot be estimated from it.  — <<Statistics for Bioengineering Science>>

2. PPV: positive predictive value. PPV=TP/(TP+FP). Need to consider how the sample was draw. If randomly draw samples from population, it may reflect the real population, just check if the prevalence is close to known information. If the samples were draw by convenience, then we need the Bayes rule to re-estimate it.

Same as Precision, 1-FDR, positive-predicted value.

FDR = #FP/#FP+ #TP

One of the most important measures is the positive predictive value (PPV). This is correct only if the population prevalence is well estimated by nD/n, that is, if the table is representative of its population. This is approximately the case for the Victoria BC data; the PPV is well estimated by 495/5401 = 0.0916.

If the table is constructed from a convenience sample, the prevalence (Pre) would be external information, and the PPV is calculated as:

Sensitivity * Prevalence / {Sensitivity * Prevalence + (1-Specificity) * (1-Prevalence)}

— <<Statistics for Bioengineering Science>>

Details: 

Prevalence: P(ill); Normalization condition: 1-prevalence: P(well)

Sensitivity: P(pos.|ill). Normalization condition: 1-sensitivity = P(neg | ill)

Specificity: P(neg.|well). Normalization condition: 1-specificity = P(pos|well).

PPV =Sensitivity * Prevalence / { Sensitivity * Prevalence + (1-Specificity) * (1-Prevalence) = P(pos.|ill) * P(ill) + P(pos|well)*P(well) = P(pos|ill) * P(ill) / P(pos) = P(ill|pos)

PPV is important!!!

Example:

Sensitivity = TP/#T = 1/1 = 100%

Specificity = TN/#N = 100/102 = 98%

PPV = TP/nP = 1/(1+2) = 33.3%

Understanding:

Sensitivity only describes how much your test can get the actual true ones. Specificity only describes how much your test can get the actual false ones. But TP = some% * (size effect from actual true ones), FP = some%*(size effect from actual false ones). So the size ratio between actual true ones and actual false ones will have a huge effect on PPV. That’s why we need to know the prevalence!!!

3. Sensitivity = TP/#T. Here #T is the normalization factor. 1-Sensitivity = FN/#T

Same as, True Positive Rate (TPR), 1 – False Negative Rate, recall, hit rate.

4. Specificity = TN/#N. Here #N is the normalization factor.

Same as, Selectivity, True Negative Rate.

1-Specificity = FP/#N= #FP/#N   = False Positive Rate (FPR)

 

Leave a comment