Research Methodology Chapter 12.2

man, graph, vision-5649675.jpg

Regression

In experimental sciences after having understood the correlation between two variables, there are situations when it is necessary to estimate or predict the value of one character (variable say Y) from the knowledge of the other character (variable say X), such as to estimate height when weight is known. This is possible when the two variables are linearly correlated. The former variable (Y i.e. weight) to be estimated is called dependent variable and the latter (X i.e. height) which is known, is called the independent variable. This is done by finding a constant called regression coefficient (b).

Regression means change in the measurements of a variable character, on the positive or negative side, beyond the mean. Regression coefficient is a measure of the change in one dependent (Y) character with one unit change in the independent character (X).

 

It is denoted by letter ‘b’ which indicates the relative change (Yc) in one variable (Y) from the mean (YÌ„) for one unit of move, deviation or change (x) in another variable (X) from the mean (XÌ„) when both are correlated. This helps to calculate or predict any expected value of Y, i.e. Yc corresponding to X. When corresponding values Yc1, Yc2….., Ycn are plotted on a graph, a straight line called the regression line or the mean correlation line (Y on X) is obtained. The same was referred to as an imaginary line while explaining various types of correlation.

 

Regression line may be Y line on X, if corresponding values of Y, i.e. Yc are calculated for X values using the regression coefficient denoted as byx which is value of yc, i.e. Yc – Ȳ for one unit of x beyond X̄ or vice versa for X on Y.

yc = Yc – Ȳ = byx (X – X̄ ) 

and 

xc = Xc –X̄ = bxy (Y –Ȳ)

 

Regression coefficients of either of the two variables, X and Y, i.e. byx or bxy for one unit change of the other, can be found by the appropriate formulae.

The above mentioned formulae could also be written as 

Y´ = bX + a

where Y´ represents the predicted value; X represents the known value; and b and a represent numbers calculated from the original correlation analysis.

 

If height changes by 1 cm from the mean height (X – X̄ = x or 161 – 160 = 1 cm) on the baseline, the increment in the weight from the mean weight Ȳ on the vertical line, is calculated by finding regression coefficient byx. It is the increase in weight in kg, (Yc – Ȳ = b) corresponding to increase in the height (x) by 1 cm from mean (X̄).

 

Correlation gives the degree and direction of relationship between the two variables, whereas the regression analysis enables us to predict the values of one variable on the basis of the other variable. Thereby, the cause and effect relationship between two variables in understood very precisely.

 

In Figures 1 C and D, the two regression lines are shown. One is X on Y and the other is Y on X, indicating conditions of moderately positive and moderately negative correlations, respectively. The two regression lines intersect at the point where perpendiculars drawn from the means of X and Y variables meet.

 

When there is perfect correlation (r = + 1 or –1), the two regression lines will coincide or become one straight line (Figs 1 A and B). On the other hand, when correlation is partial, the lines will be separate and diverge forming an acute angle at the meeting point of perpendiculars drawn from the means of two variables.

 

Figures 1 A to E: The diagrams are taken from hypothetical numbers to show different types of correlation & regression lines

 

Lesser the correlation, greater will be the divergence of angle. When correlation becomes nil or 0, i.e. the variables are independent, the two lines intersect at right angle (Fig. 1C).

 

Steepness of the lines indicates the extent of correlation. Closer the correlation, greater is the steepness of regression lines X on Y and Y on X.

 

Mathematical Expression:

The above mentioned formula could also be written as 

Y´ = bX + a

Where,  Y´ represents the predicted value

X represents the known value

b and a represent numbers calculated from the original correlation analysis

 

This is referred to as Least Square Regression Equation

 

To avoid the arithmetic standoff of zero always produced by adding positive and negative predictive errors (associated with errors above and below the regression line, respectively), the placement of the regression line minimizes not the total predictive error but the total squared predictive error, that is, the total for all squared predictive errors. When located in this fashion, the regression line is often referred to as the least squares regression line. Although more difficult to visualize, this approach is consistent with the original aim—to minimize the total predictive error or some version of the total predictive error, thereby providing a more favourable prognosis for our predictions.

 

Finding the values of ‘a’ and ‘b’:

To obtain a working regression equation, solve each of the following expressions, first for b and then for a, using data from the original correlation analysis. The expression for b reads:

Where, r is the calculated Pearson’s correlation coefficient value, SSy represents the sum of squares for all Y scores; and SSx represents the sum of squares for all X scores.

 

The expression for a reads:

a = Ȳ – bX̄ 

 

Where, YÌ„ and XÌ„ refer to the sample means for all Y and X scores, respectively, and b is defined by the preceding expression.

The values of all terms in the expressions for b and a can be obtained from the original correlation analysis either directly, as with the value of r, or indirectly, as with the values of the remaining terms: SSy, SSx, YÌ„, and XÌ„.

 

Assumptions:

Use of the regression equation requires that the underlying relationship be linear. You need to worry about violating this assumption only when the scatterplot for the original correlation analysis reveals an obviously bent or curvilinear dot cluster. In the unlikely event that a dot cluster describes a pronounced curvilinear trend, consult more advanced statistics tools. Use of the standard error of estimate (represented as sy|x), assumes that except for chance, the dots in the original scatterplot will be dispersed equally about all segments of the regression line. You need to worry about violating this assumption of homoscedasticity, only when the scatterplot reveals a dramatically different type of dot cluster. 

 

Calculation method:

Using sample data, the Least Square Regression Equation is determined through following steps.

 

Step 1. Determine the values of r using the correlation coefficient formula:

Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

 

Step 2. Determine the constant b:

Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

 

Step 3. Determine the constant a:

a = ȳ – bx̄ 

Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).

 

Step 4. Determine the Least Square Regression Equation: 

Y´ = a + bX

Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis

 

Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:

R2 =  r2

Where, r is the Pearson’s correlation coefficient

 

Step 6. Inference

 

 

Example:

A research student determined the blood glucose levels for 6 normal males aged 21 to 59. Find the value of the Pearson’s correlation coefficient (r) between age and glucose levels detected. Use a 0.05 level of significance.

S.No.

Body Weight (in Kg) (x)

Mean Arterial Blood Pressure (mmHg) (y)

1

20

80

2

30

78

3

40

90

4

50

92

5

60

76

6

70

78

7

80

86

8

90

76

9

100

108

10

110

74

11

120

85

12

130

108

13

140

110

14

150

88

15

160

90

16

170

80

17

180

118

18

20

150

19

30

89

20

40

90

21

50

75

22

60

78

23

70

108

24

80

145

25

90

198

26

100

149

 

Solution: 

S.No.

x 

y

(x – x̄)

(y – ȳ)

(x – x̄)2

(y – ȳ)2

1

20

80

20 – 142.8 

80 – 99.9

15079.84

396.01

2

30

78

30 – 142.8

78 – 99.9

12723.84

479.61

3

40

90

40 – 142.8

90 – 99.9

10567.84

98.01

4

50

92

50 – 142.8

92 – 99.9

8611.84

62.41

5

60

76

60 – 142.8

76 – 99.9

6855.84

571.21

6

70

78

70 – 142.8

78 – 99.9

5299.84

479.61

7

80

86

80 – 142.8

86 – 99.9

3943.84

193.21

8

90

76

90 – 142.8

76 – 99.9

2787.84

571.21

9

100

108

100 – 142.8

108 – 99.9

1831.84

65.61

10

110

74

110 – 142.8

74 – 99.9

1075.84

670.81

11

120

85

120 – 142.8

85 – 99.9

519.84

222.01

12

130

108

130 – 142.8

108 – 99.9

163.84

65.61

13

140

110

140 – 142.8

110 – 99.9

7.84

102.01

14

150

88

150 – 142.8

88 – 99.9

51.84

141.61

15

160

90

160 – 142.8

90 – 99.9

295.84

98.01

16

170

80

170 – 142.8

80 – 99.9

739.84

396.01

17

180

118

180 – 142.8

118 – 99.9

1383.84

327.61

18

20

150

20 – 142.8

150 – 99.9

15079.84

2510.01

19

30

89

30 – 142.8

89 – 99.9

12723.84

118.81

20

40

90

40 – 142.8

90 – 99.9

10567.84

98.01

21

50

75

50 – 142.8

75 – 99.9

8611.84

620.01

22

60

78

60 – 142.8

78 – 99.9

6855.84

479.61

23

70

108

70 – 142.8

108 – 99.9

5299.84

65.61

24

80

145

80 – 142.8

145 – 99.9

3943.84

2034.01

25

90

198

90 – 142.8

198 – 99.9

2787.84

9623.61

26

100

149

100 – 142.8

149 – 99.9

1831.84

2410.81

Mean

142.8

99.9

  

139643.84

22901.06

 

x̄

ȳ

    

Step 1. Determine the values of r using correlation coefficient formula:

Where, r is the correlation coefficient, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

 

By substituting values in Correlation coefficient formula, we get,

 

r = +0.57

 

Step 2. Determine the constant b:


Where, b is the coefficient value of the slope B, x̄ is the mean of all x scores; and ȳ is the mean of all y scores.

 

By substituting the data in the formula we get,

 

b= 0.235

 

Step 3. Determine the constant a:

a = ȳ – bx̄ 

Where, ȳ and x̄ refer to the sample means for all Y and X scores, respectively, and b is the coefficient value of the slope B as determined in the preceding step (Step 2).

By substituting the values in the formula we get,

a = 99.9 – (0.235 X 142.8) 

a = 66.2 

 

Step 4. Determine the Least Square Regression Equation: 

Y´ = a + bX 

Where, Y´ represents the predicted value, X represents the known value, b and a represent numbers calculated from the original correlation analysis

 

Suppose we want to predict mean arterial blood pressure for a particular subject from our cohort. His weight is 129 kg:

Y´ = a + bX 

 

Y´ = 66.2 + 0.235(129) 

Y´ = 96.5 mmHg

Nevertheless, it is also necessary to determine how much this mean arterial blood pressure (Y’) is explainable by his body weight (x), according to this model.

 

Step 5. Compute the value of R2. The value of R2 (coefficient of determination) is calculated using this computation formula:

R2 =  r2

 

Where, r is the Pearson’s correlation coefficient

R2 = +0.572

R2 = 0.32 or 32%

 

Step 6. Inference

This figure of R2 means that from all possible reasons for the patient to present a 96.5 mmHg mean arterial blood pressure (y’), 32% can be explained by his 129-kg body weight (x), according to linear regression model.

Leave a Comment

Your email address will not be published. Required fields are marked *