Statistical techniques usually applied to unravel the relationship between Y and X in numerical form are based on the understanding that all errors are in the dependent variable, the independent variable having been measured without errors. This is never the case, but as long as errors in the independent variable are small it is usually of limited consequence. Other requirements are usually that variables are linearly related, and that they are normally distributed (ref. 1). Normal is a statistical notion related to a certain bell-shaped distribution of measured variables, such as in the accompanying illustrations. In a normal distribution the bell is symmetric and can be described by the mean of all measurements and the standard deviation (SD), a measure of the spread around the mean. From inspection of the X-Y diagrams (see figures) it appears that both conditions are in all likelihood met. In statistical software the regression line, which describes the relation between Y and X, is then computed by minimizing the difference between observed (y) and predicted (Y) values (ref. 2).
With the best fitting line the deviations y - Y are minimized: there are negative and positive deviations, but their mean is nil. We can now subdivide the bell into two portions: a part of total variation is explained by the computed relationship between dependent (IVC in our case) and independent (length) variable, shown as the dark bell; what remains is unexplained variance, i.e. not related to differences in length. The unexplained or residual variance is the remaining yellow bell area. The explained variance can be expressed in proportional terms from the ratio of the surface of the black bell to total bell area. The strength of the relationship between IVC and length is expressed in the coefficient of correlation. How computations are performed is beyond the scope of these texts; it suffices to know that the coefficient of correlation (symbol r) is the square root of explained variance. Hence, if r = 0.80, then the explained variance is 0.80² = 0.64 or 64%. If r = 0.10, then only 1% of total variance is explained by length.
In the relationship Y = a + b·X we have now computed a (the intercept) and b (the slope). We have also estimated the residual scatter (RSD, residual standard deviation. This allows us to compute for each value of X (within the original range!) the expected value of Y, and estimate from the RSD how in practice measurements will scatter around Y in the case of a normal distribution.
The example relates only to the relationship between IVC and length. In adults the IVC declines with age. A portion of the residual variance, i.e. after allowing for differences in standing height, can therefore be explained by differences in age. Therefore we can include age in the regression analysis and assess whether its addition reduces residual variance in a meaningful way (‘significant’ is a statistical term). In adults this is invariably the case. The regression equation in adults is therefore:
IVC = a + b·length + c·age
where b, the regression coefficient for length, is positive because the IVC increases with increasing length, but c is negative, indicating that the IVC declines with age.
Ref. 1 - The
Altman DG, Bland JM. The normal distribution. BMJ 1995; 310: 298.
Ref. 2 - Regression analysis
Greenhalgh T. Statistics for the non-statistician. II: “Significant” relations and their pitfalls. BMJ 1997; 315: 422-425.