Explained variance
Statistical techniques usually applied
to unravel the relationship between Y and X in numerical form
are based on the understanding that all errors are in the
dependent variable, the independent variable having been measured
without errors. This is never the case, but as long as errors
in the independent variable are small it is usually of limited
consequence. Other requirements are usually that variables
are linearly related, and that they are normally distributed
(ref. 1).
Normal is a statistical notion related to a certain bell-shaped
distribution of measured variables, such as in the accompanying
illustrations. In a normal distribution the bell is symmetric
and can be described by the mean of all measurements and the
standard deviation (SD), a measure of the spread around the
mean. From inspection of the X-Y diagrams (see figures) it
appears that both conditions are in all likelihood met. In
statistical software the regression line, which describes
the relation between Y and X, is then computed by minimizing
the difference between observed (y) and predicted (Y) values
(ref. 2).
With
the best fitting line the deviations y - Y are minimized:
there are negative and positive deviations, but their mean
is nil. We can now subdivide the bell into two portions: a
part of total variation is explained by the computed relationship
between dependent (IVC in our case) and independent (length)
variable, shown as the dark bell; what remains is unexplained
variance, i.e. not related to differences in length. The unexplained
or residual variance is the remaining yellow bell area. The
explained variance can be expressed in proportional terms
from the ratio of the surface of the black bell to total bell
area. The strength of the relationship between IVC and length
is expressed in the coefficient of correlation. How computations
are performed is beyond the scope of these texts; it suffices
to know that the coefficient of correlation (symbol r) is
the square root of explained variance. Hence, if r = 0.80,
then the explained variance is 0.80² = 0.64 or 64%. If
r = 0.10, then only 1% of total variance is explained by length.
In the relationship Y = a + b·X we have now computed a (the intercept) and b (the slope). We have also estimated the residual scatter (RSD, residual standard deviation. This allows us to compute for each value of X (within the original range!) the expected value of Y, and estimate from the RSD how in practice measurements will scatter around Y in the case of a normal distribution.
The example relates only to the relationship between IVC and length. In adults the IVC declines with age. A portion of the residual variance, i.e. after allowing for differences in standing height, can therefore be explained by differences in age. Therefore we can include age in the regression analysis and assess whether its addition reduces residual variance in a meaningful way (‘significant’ is a statistical term). In adults this is invariably the case. The regression equation in adults is therefore:
IVC = a + b·length + c·age
where b, the regression coefficient for length, is positive because the IVC increases with increasing length, but c is negative, indicating that the IVC declines with age.
Ref. 1 - The
normal distribution
Altman DG, Bland JM. The normal distribution. BMJ 1995; 310: 298.
Ref. 2 - Regression analysis
Greenhalgh T. Statistics for the non-statistician. II: “Significant”
relations and their pitfalls. BMJ 1997; 315: 422-425.