We at Analytics University have created study packs to help students and working professionals build expertise in various fields of data analytics ## Analytic University

/  Uncategorized   /  Step by Step Linear Regression in R

## Step by Step Linear Regression in R

Linear Regression in R

Linear regression builds up a relationship between dependent/target variable (Y) and one or more independent variables/predictors (X) utilizing a best fit straight line (Regression line).

The regression line is represented by a linear equation Y = a + bX.

Where,

• = Y intercept (Value of Y when X is equal to 0)
• = the change in Y for each 1 increment change in X

Simple Linear Regression: One Dependent Variable and one independent variable.

ü Regression equation:

Y’= a + bX

Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b = the change in Y for each 1 increment change in X Linear regression makes several key assumptions:

• Linear relationship.
• Multivariate normality.
• No or little multicollinearity.
• No auto-correlation. 

Multiple Linear Regression: One dependent variable and more than one independent variable.

ü Multiple Linear Regression equation:

Y’ = a + b1 X1 + b2X2 + b3X3

Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X1, X2, X3 = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b1 = the change in Y for each 1 increment change in X1  b2 = the change in Y for each 1 increment change in X2        b3 = the change in Y for each 1 increment change in X3

Multiple linear regression analysis makes several key assumptions:

• Linear relationship
• Multivariate normality
• No or little multicollinearity
• No auto-correlation
• Homoscedasticity

Steps in Building a Simple Linear Regression model:

Ex: Fitting the Simple Linear Regression model for the dataset

“faithful” in R.

• Data Collection and understanding the data:
• Fitting Linear Regression Model:
• Predicting the dependent variable based on the independent variable using the regression model:
• e. eruption = intercept + (slope*waiting)
• For the independent variable value of waiting = 80, the predicted dependent variable value is 4.17.
• Test of Significance:
• If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
• As P-value < 0.005, null hypothesis is rejected and there exists a significant relationship between the variables.
• Co-efficient of Determination – R Squared value:
• Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
• R value lies between 0and 1.
• If R=1.00, there exists relationship between dependent and independent variables. But if R=0.00, then no relationship exists.
• In above case R –squared value is 0.81, which means 81% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.
• Finding the Confidence Interval:
• Confidence interval (95%) for the mean value of the dependent variable for a given value of independent variable.
• For above case when waiting time= 80 mins, the 95% prediction interval for eruption is 4.17.
• Finding the Prediction interval:
• Interval estimate of the dependant variable for a given value of independent variable.
• The lower and upper limits are 3.19 and 5.15 respectively.

This is how a Simple Linear Regression is fitted in R.

Steps in Building a Multiple Linear Regression Model:

Ex: Fitting the Multiple Linear Regression model for the dataset

“Stackloss” in R.

• Data Collection and understanding the data:
• Predicting the dependent variable based on the independent variable using the regression model:
• e. stackloss = intercept + (slope*Air.Flow) + (slope * Water.Temp) + (slope*Acid.Conc.)
• For the independent variable values, the predicted dependent variable value is 24.58.
• Test of Significance:

• If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
• In the above case, except the Acid Concentration, all the other independent variables have a significant relationship with the dependent variable.
• As P- value of Acid Conc. is more than 0.005, so null hypothesis can’t be rejected, so Acid Conc. has no significant relationship with the dependent variable.
• Co-efficient of Determination – R Squared value:
• Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
• In above case R –squared value is 0.91, which means 91% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.
• Finding the Confidence Interval:
• Confidence interval (95%) for the mean value of the dependent variable for the given values of independent variables.
• For above case, the 95% prediction interval for stackloss is

24.58.

• Finding the Prediction interval:
• Interval estimate of the dependant variable for the given values of independent variables.
• The lower and upper limits are 16.46 and 32.69 respectively.

This is how a Multiple Linear Regression is fitted in R.

Contributed by: Prasand Kumar

REFERENCES: