
Regression
Regression means change in the measurements of a variable character, on the positive or negative side, beyond the mean. Regression coefficient is a measure of the change in one dependent (Y) character with one unit change in the independent character (X).
Â
It is denoted by letter ‘b’ which indicates the relative change (Yc) in one variable (Y) from the mean (YÌ„) for one unit of move, deviation or change (x) in another variable (X) from the mean (XÌ„) when both are correlated. This helps to calculate or predict any expected value of Y, i.e. Yc corresponding to X. When corresponding values Yc1, Yc2….., Ycn are plotted on a graph, a straight line called the regression line or the mean correlation line (Y on X) is obtained. The same was referred to as an imaginary line while explaining various types of correlation.
Â
Regression line may be Y line on X, if corresponding values of Y, i.e. Yc are calculated for X values using the regression coefficient denoted as byx which is value of yc, i.e. Yc – Ȳ for one unit of x beyond X̄ or vice versa for X on Y.
yc = Yc – YÌ„ = byx (X – XÌ„ )Â
andÂ
xc = Xc –X̄ = bxy (Y –Ȳ)
Â
Regression coefficients of either of the two variables, X and Y, i.e. byx or bxy for one unit change of the other, can be found by the appropriate formulae.
The above mentioned formulae could also be written asÂ
Y´ = bX + a
where Y´ represents the predicted value; X represents the known value; and b and a represent numbers calculated from the original correlation analysis.
Â
If height changes by 1 cm from the mean height (X – X̄ = x or 161 – 160 = 1 cm) on the baseline, the increment in the weight from the mean weight Ȳ on the vertical line, is calculated by finding regression coefficient byx. It is the increase in weight in kg, (Yc – Ȳ = b) corresponding to increase in the height (x) by 1 cm from mean (X̄).
Â
Correlation gives the degree and direction of relationship between the two variables, whereas the regression analysis enables us to predict the values of one variable on the basis of the other variable. Thereby, the cause and effect relationship between two variables in understood very precisely.
Â
In Figures 1 C and D, the two regression lines are shown. One is X on Y and the other is Y on X, indicating conditions of moderately positive and moderately negative correlations, respectively. The two regression lines intersect at the point where perpendiculars drawn from the means of X and Y variables meet.
Â
When there is perfect correlation (r = + 1 or –1), the two regression lines will coincide or become one straight line (Figs 1 A and B). On the other hand, when correlation is partial, the lines will be separate and diverge forming an acute angle at the meeting point of perpendiculars drawn from the means of two variables.
Â
Figures 1 A to E: The diagrams are taken from hypothetical numbers to show different types of correlation & regression lines
Â
Lesser the correlation, greater will be the divergence of angle. When correlation becomes nil or 0, i.e. the variables are independent, the two lines intersect at right angle (Fig. 1C).
Â
Steepness of the lines indicates the extent of correlation. Closer the correlation, greater is the steepness of regression lines X on Y and Y on X.
Â
Mathematical Expression:
The above mentioned formula could also be written asÂ
Y´ = bX + a
Where,  Y´ represents the predicted value
X represents the known value
b and a represent numbers calculated from the original correlation analysis
Â
This is referred to as Least Square Regression Equation
Â
To avoid the arithmetic standoff of zero always produced by adding positive and negative predictive errors (associated with errors above and below the regression line, respectively), the placement of the regression line minimizes not the total predictive error but the total squared predictive error, that is, the total for all squared predictive errors. When located in this fashion, the regression line is often referred to as the least squares regression line. Although more difficult to visualize, this approach is consistent with the original aim—to minimize the total predictive error or some version of the total predictive error, thereby providing a more favourable prognosis for our predictions.
Â
Finding the values of ‘a’ and ‘b’:
To obtain a working regression equation, solve each of the following expressions, first for b and then for a, using data from the original correlation analysis. The expression for b reads:
Where, r is the calculated Pearson’s correlation coefficient value, SSy represents the sum of squares for all Y scores; and SSx represents the sum of squares for all X scores.
Â
The expression for a reads:
a = YÌ„ – bXÌ„Â
Â
Where, YÌ„ and XÌ„ refer to the sample means for all Y and X scores, respectively, and b is defined by the preceding expression.
The values of all terms in the expressions for b and a can be obtained from the original correlation analysis either directly, as with the value of r, or indirectly, as with the values of the remaining terms: SSy, SSx, YÌ„, and XÌ„.
Â
Assumptions:
Use of the regression equation requires that the underlying relationship be linear. You need to worry about violating this assumption only when the scatterplot for the original correlation analysis reveals an obviously bent or curvilinear dot cluster. In the unlikely event that a dot cluster describes a pronounced curvilinear trend, consult more advanced statistics tools. Use of the standard error of estimate (represented as sy|x), assumes that except for chance, the dots in the original scatterplot will be dispersed equally about all segments of the regression line. You need to worry about violating this assumption of homoscedasticity, only when the scatterplot reveals a dramatically different type of dot cluster.Â
Â
Calculation method:
Using sample data, the Least Square Regression Equation is determined through following steps.
Â
Step 1. Determine the values of r using the correlation coefficient formula:
Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.
Â
Step 2. Determine the constant b:
Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.
Â
Step 3. Determine the constant a:
a = yÌ„ – bxÌ„Â
Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).
Â
Step 4. Determine the Least Square Regression Equation:Â
Y´ = a + bX
Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis
Â
Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:
R2 =Â r2
Where, r is the Pearson’s correlation coefficient
Â
Step 6. Inference
Â
Â
Example:
A research student determined the blood glucose levels for 6 normal males aged 21 to 59. Find the value of the Pearson’s correlation coefficient (r) between age and glucose levels detected. Use a 0.05 level of significance.
S.No. | Body Weight (in Kg) (x) | Mean Arterial Blood Pressure (mmHg) (y) |
1 | 20 | 80 |
2 | 30 | 78 |
3 | 40 | 90 |
4 | 50 | 92 |
5 | 60 | 76 |
6 | 70 | 78 |
7 | 80 | 86 |
8 | 90 | 76 |
9 | 100 | 108 |
10 | 110 | 74 |
11 | 120 | 85 |
12 | 130 | 108 |
13 | 140 | 110 |
14 | 150 | 88 |
15 | 160 | 90 |
16 | 170 | 80 |
17 | 180 | 118 |
18 | 20 | 150 |
19 | 30 | 89 |
20 | 40 | 90 |
21 | 50 | 75 |
22 | 60 | 78 |
23 | 70 | 108 |
24 | 80 | 145 |
25 | 90 | 198 |
26 | 100 | 149 |
Â
Solution:Â
S.No. | x | y | (x – x̄) | (y – ȳ) | (x – x̄)2 | (y – ȳ)2 |
1 | 20 | 80 | 20 – 142.8 | 80 – 99.9 | 15079.84 | 396.01 |
2 | 30 | 78 | 30 – 142.8 | 78 – 99.9 | 12723.84 | 479.61 |
3 | 40 | 90 | 40 – 142.8 | 90 – 99.9 | 10567.84 | 98.01 |
4 | 50 | 92 | 50 – 142.8 | 92 – 99.9 | 8611.84 | 62.41 |
5 | 60 | 76 | 60 – 142.8 | 76 – 99.9 | 6855.84 | 571.21 |
6 | 70 | 78 | 70 – 142.8 | 78 – 99.9 | 5299.84 | 479.61 |
7 | 80 | 86 | 80 – 142.8 | 86 – 99.9 | 3943.84 | 193.21 |
8 | 90 | 76 | 90 – 142.8 | 76 – 99.9 | 2787.84 | 571.21 |
9 | 100 | 108 | 100 – 142.8 | 108 – 99.9 | 1831.84 | 65.61 |
10 | 110 | 74 | 110 – 142.8 | 74 – 99.9 | 1075.84 | 670.81 |
11 | 120 | 85 | 120 – 142.8 | 85 – 99.9 | 519.84 | 222.01 |
12 | 130 | 108 | 130 – 142.8 | 108 – 99.9 | 163.84 | 65.61 |
13 | 140 | 110 | 140 – 142.8 | 110 – 99.9 | 7.84 | 102.01 |
14 | 150 | 88 | 150 – 142.8 | 88 – 99.9 | 51.84 | 141.61 |
15 | 160 | 90 | 160 – 142.8 | 90 – 99.9 | 295.84 | 98.01 |
16 | 170 | 80 | 170 – 142.8 | 80 – 99.9 | 739.84 | 396.01 |
17 | 180 | 118 | 180 – 142.8 | 118 – 99.9 | 1383.84 | 327.61 |
18 | 20 | 150 | 20 – 142.8 | 150 – 99.9 | 15079.84 | 2510.01 |
19 | 30 | 89 | 30 – 142.8 | 89 – 99.9 | 12723.84 | 118.81 |
20 | 40 | 90 | 40 – 142.8 | 90 – 99.9 | 10567.84 | 98.01 |
21 | 50 | 75 | 50 – 142.8 | 75 – 99.9 | 8611.84 | 620.01 |
22 | 60 | 78 | 60 – 142.8 | 78 – 99.9 | 6855.84 | 479.61 |
23 | 70 | 108 | 70 – 142.8 | 108 – 99.9 | 5299.84 | 65.61 |
24 | 80 | 145 | 80 – 142.8 | 145 – 99.9 | 3943.84 | 2034.01 |
25 | 90 | 198 | 90 – 142.8 | 198 – 99.9 | 2787.84 | 9623.61 |
26 | 100 | 149 | 100 – 142.8 | 149 – 99.9 | 1831.84 | 2410.81 |
Mean | 142.8 | 99.9 | Â | Â | 139643.84 | 22901.06 |
 | x̄ | ȳ |  |  |  |  |
Step 1. Determine the values of r using correlation coefficient formula:
Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.
Â
By substituting values in Correlation coefficient formula, we get,
Â
r = +0.57
Â
Step 2. Determine the constant b:
Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.
Â
By substituting the data in the formula we get,
Â
b= 0.235
Â
Step 3. Determine the constant a:
a = yÌ„ – bxÌ„Â
Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).
By substituting the values in the formula we get,
a = 99.9 – (0.235 X 142.8)Â
a = 66.2Â
Â
Step 4. Determine the Least Square Regression Equation:Â
Y´ = a + bXÂ
Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis
Â
Suppose we want to predict mean arterial blood pressure for a particular subject from our cohort. His weight is 129 kg:
Y´ = a + bXÂ
Â
Y´ = 66.2 + 0.235(129)Â
Y´ = 96.5 mmHg
Nevertheless, it is also necessary to determine how much this mean arterial blood pressure (Y’) is explainable by his body weight (x), according to this model.
Â
Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:
R2 =Â r2
Â
Where, r is the Pearson’s correlation coefficient
R2Â = +0.572
R2Â = 0.32 or 32%
Â
Step 6. Inference
This figure of R2 means that from all possible reasons for the patient to present a 96.5 mmHg mean arterial blood pressure (y’), 32% can be explained by his 129-kg body weight (x), according to linear regression model.