Research Methodology Chapter 12.2

Regression

In experimental sciences after having understood the correlation between two variables, there are situations when it is necessary to estimate or predict the value of one character (variable say Y) from the knowledge of the other character (variable say X), such as to estimate height when weight is known. This is possible when the two variables are linearly correlated. The former variable (Y i.e. weight) to be estimated is called dependent variable and the latter (X i.e. height) which is known, is called the independent variable. This is done by finding a constant called regression coefficient (b).

Regression means change in the measurements of a variable character, on the positive or negative side, beyond the mean. Regression coefficient is a measure of the change in one dependent (Y) character with one unit change in the independent character (X).

It is denoted by letter ‘b’ which indicates the relative change (Yc) in one variable (Y) from the mean (Ȳ) for one unit of move, deviation or change (x) in another variable (X) from the mean (X̄) when both are correlated. This helps to calculate or predict any expected value of Y, i.e. Yc corresponding to X. When corresponding values Yc1, Yc2….., Ycn are plotted on a graph, a straight line called the regression line or the mean correlation line (Y on X) is obtained. The same was referred to as an imaginary line while explaining various types of correlation.

Regression line may be Y line on X, if corresponding values of Y, i.e. Yc are calculated for X values using the regression coefficient denoted as byx which is value of yc, i.e. Yc – Ȳ for one unit of x beyond X̄ or vice versa for X on Y.

yc = Yc – Ȳ = byx (X – X̄ )

and

xc = Xc –X̄ = bxy (Y –Ȳ)

Regression coefficients of either of the two variables, X and Y, i.e. byx or bxy for one unit change of the other, can be found by the appropriate formulae.

The above mentioned formulae could also be written as

Y´ = bX + a

where Y´ represents the predicted value; X represents the known value; and b and a represent numbers calculated from the original correlation analysis.

If height changes by 1 cm from the mean height (X – X̄ = x or 161 – 160 = 1 cm) on the baseline, the increment in the weight from the mean weight Ȳ on the vertical line, is calculated by finding regression coefficient byx. It is the increase in weight in kg, (Yc – Ȳ = b) corresponding to increase in the height (x) by 1 cm from mean (X̄).

Correlation gives the degree and direction of relationship between the two variables, whereas the regression analysis enables us to predict the values of one variable on the basis of the other variable. Thereby, the cause and effect relationship between two variables in understood very precisely.

In Figures 1 C and D, the two regression lines are shown. One is X on Y and the other is Y on X, indicating conditions of moderately positive and moderately negative correlations, respectively. The two regression lines intersect at the point where perpendiculars drawn from the means of X and Y variables meet.

When there is perfect correlation (r = + 1 or –1), the two regression lines will coincide or become one straight line (Figs 1 A and B). On the other hand, when correlation is partial, the lines will be separate and diverge forming an acute angle at the meeting point of perpendiculars drawn from the means of two variables.

Figures 1 A to E: The diagrams are taken from hypothetical numbers to show different types of correlation & regression lines

Lesser the correlation, greater will be the divergence of angle. When correlation becomes nil or 0, i.e. the variables are independent, the two lines intersect at right angle (Fig. 1C).

Steepness of the lines indicates the extent of correlation. Closer the correlation, greater is the steepness of regression lines X on Y and Y on X.

Mathematical Expression:

The above mentioned formula could also be written as

Y´ = bX + a

Where, Y´ represents the predicted value

X represents the known value

b and a represent numbers calculated from the original correlation analysis

This is referred to as Least Square Regression Equation

To avoid the arithmetic standoff of zero always produced by adding positive and negative predictive errors (associated with errors above and below the regression line, respectively), the placement of the regression line minimizes not the total predictive error but the total squared predictive error, that is, the total for all squared predictive errors. When located in this fashion, the regression line is often referred to as the least squares regression line. Although more difficult to visualize, this approach is consistent with the original aim—to minimize the total predictive error or some version of the total predictive error, thereby providing a more favourable prognosis for our predictions.

Finding the values of ‘a’ and ‘b’:

To obtain a working regression equation, solve each of the following expressions, first for b and then for a, using data from the original correlation analysis. The expression for b reads:

Where, r is the calculated Pearson’s correlation coefficient value, SSy represents the sum of squares for all Y scores; and SSx represents the sum of squares for all X scores.

The expression for a reads:

a = Ȳ – bX̄

Where, Ȳ and X̄ refer to the sample means for all Y and X scores, respectively, and b is defined by the preceding expression.

The values of all terms in the expressions for b and a can be obtained from the original correlation analysis either directly, as with the value of r, or indirectly, as with the values of the remaining terms: SSy, SSx, Ȳ, and X̄.

Assumptions:

Use of the regression equation requires that the underlying relationship be linear. You need to worry about violating this assumption only when the scatterplot for the original correlation analysis reveals an obviously bent or curvilinear dot cluster. In the unlikely event that a dot cluster describes a pronounced curvilinear trend, consult more advanced statistics tools. Use of the standard error of estimate (represented as sy|x), assumes that except for chance, the dots in the original scatterplot will be dispersed equally about all segments of the regression line. You need to worry about violating this assumption of homoscedasticity, only when the scatterplot reveals a dramatically different type of dot cluster.

Calculation method:

Using sample data, the Least Square Regression Equation is determined through following steps.

Step 1. Determine the values of r using the correlation coefficient formula:

Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

Step 2. Determine the constant b:

Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

Step 3. Determine the constant a:

a = ȳ – bx̄

Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).

Step 4. Determine the Least Square Regression Equation:

Y´ = a + bX

Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis

Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:

R2 = r2

Where, r is the Pearson’s correlation coefficient

Step 6. Inference

Example:

A research student determined the blood glucose levels for 6 normal males aged 21 to 59. Find the value of the Pearson’s correlation coefficient (r) between age and glucose levels detected. Use a 0.05 level of significance.

S.No.	Body Weight (in Kg) (x)	Mean Arterial Blood Pressure (mmHg) (y)
1	20	80
2	30	78
3	40	90
4	50	92
5	60	76
6	70	78
7	80	86
8	90	76
9	100	108
10	110	74
11	120	85
12	130	108
13	140	110
14	150	88
15	160	90
16	170	80
17	180	118
18	20	150
19	30	89
20	40	90
21	50	75
22	60	78
23	70	108
24	80	145
25	90	198
26	100	149

Solution:

S.No.	x	y	(x – x̄)	(y – ȳ)	(x – x̄)2	(y – ȳ)2
1	20	80	20 – 142.8	80 – 99.9	15079.84	396.01
2	30	78	30 – 142.8	78 – 99.9	12723.84	479.61
3	40	90	40 – 142.8	90 – 99.9	10567.84	98.01
4	50	92	50 – 142.8	92 – 99.9	8611.84	62.41
5	60	76	60 – 142.8	76 – 99.9	6855.84	571.21
6	70	78	70 – 142.8	78 – 99.9	5299.84	479.61
7	80	86	80 – 142.8	86 – 99.9	3943.84	193.21
8	90	76	90 – 142.8	76 – 99.9	2787.84	571.21
9	100	108	100 – 142.8	108 – 99.9	1831.84	65.61
10	110	74	110 – 142.8	74 – 99.9	1075.84	670.81
11	120	85	120 – 142.8	85 – 99.9	519.84	222.01
12	130	108	130 – 142.8	108 – 99.9	163.84	65.61
13	140	110	140 – 142.8	110 – 99.9	7.84	102.01
14	150	88	150 – 142.8	88 – 99.9	51.84	141.61
15	160	90	160 – 142.8	90 – 99.9	295.84	98.01
16	170	80	170 – 142.8	80 – 99.9	739.84	396.01
17	180	118	180 – 142.8	118 – 99.9	1383.84	327.61
18	20	150	20 – 142.8	150 – 99.9	15079.84	2510.01
19	30	89	30 – 142.8	89 – 99.9	12723.84	118.81
20	40	90	40 – 142.8	90 – 99.9	10567.84	98.01
21	50	75	50 – 142.8	75 – 99.9	8611.84	620.01
22	60	78	60 – 142.8	78 – 99.9	6855.84	479.61
23	70	108	70 – 142.8	108 – 99.9	5299.84	65.61
24	80	145	80 – 142.8	145 – 99.9	3943.84	2034.01
25	90	198	90 – 142.8	198 – 99.9	2787.84	9623.61
26	100	149	100 – 142.8	149 – 99.9	1831.84	2410.81
Mean	142.8	99.9			139643.84	22901.06
	x̄	ȳ

Step 1. Determine the values of r using correlation coefficient formula:

Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

By substituting values in Correlation coefficient formula, we get,

r = +0.57

Step 2. Determine the constant b:

Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

By substituting the data in the formula we get,

b= 0.235

Step 3. Determine the constant a:

a = ȳ – bx̄

Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).

By substituting the values in the formula we get,

a = 99.9 – (0.235 X 142.8)

a = 66.2

Step 4. Determine the Least Square Regression Equation:

Y´ = a + bX

Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis

Suppose we want to predict mean arterial blood pressure for a particular subject from our cohort. His weight is 129 kg:

Y´ = a + bX

Y´ = 66.2 + 0.235(129)

Y´ = 96.5 mmHg

Nevertheless, it is also necessary to determine how much this mean arterial blood pressure (Y’) is explainable by his body weight (x), according to this model.

Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:

R2 = r2

Where, r is the Pearson’s correlation coefficient

R2 = +0.572

R2 = 0.32 or 32%

Step 6. Inference

This figure of R2 means that from all possible reasons for the patient to present a 96.5 mmHg mean arterial blood pressure (y’), 32% can be explained by his 129-kg body weight (x), according to linear regression model.

Regression

Leave a Comment Cancel Reply