We at Analytics University have created study packs to help students and working professionals build expertise in various fields of data analytics

Image Alt

Analytic University

  /  Uncategorized   /  Step by Step Linear Regression in R

Step by Step Linear Regression in R

Linear Regression in R

Linear regression builds up a relationship between dependent/target variable (Y) and one or more independent variables/predictors (X) utilizing a best fit straight line (Regression line).

The regression line is represented by a linear equation Y = a + bX.

Where,

  • = Y intercept (Value of Y when X is equal to 0)
  • = the change in Y for each 1 increment change in X

Simple Linear Regression: One Dependent Variable and one independent variable.

ü Regression equation:

Y’= a + bX

Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b = the change in Y for each 1 increment change in X Linear regression makes several key assumptions:

  • Linear relationship.
  • Multivariate normality.
  • No or little multicollinearity.
  • No auto-correlation. 

Multiple Linear Regression: One dependent variable and more than one independent variable.

ü Multiple Linear Regression equation:

Y’ = a + b1 X1 + b2X2 + b3X3

             Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X1, X2, X3 = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b1 = the change in Y for each 1 increment change in X1  b2 = the change in Y for each 1 increment change in X2        b3 = the change in Y for each 1 increment change in X3

Multiple linear regression analysis makes several key assumptions:

  • Linear relationship
  • Multivariate normality
  • No or little multicollinearity
  • No auto-correlation
  • Homoscedasticity

Steps in Building a Simple Linear Regression model:

 

Ex: Fitting the Simple Linear Regression model for the dataset

“faithful” in R.

  • Data Collection and understanding the data:
  • Fitting Linear Regression Model:
  • Predicting the dependent variable based on the independent variable using the regression model:
  • e. eruption = intercept + (slope*waiting)
  • For the independent variable value of waiting = 80, the predicted dependent variable value is 4.17.
  • Test of Significance:
  • If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
  • As P-value < 0.005, null hypothesis is rejected and there exists a significant relationship between the variables.
  • Co-efficient of Determination – R Squared value:
  • Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
  • R value lies between 0and 1.
  • If R=1.00, there exists relationship between dependent and independent variables. But if R=0.00, then no relationship exists.
  • In above case R –squared value is 0.81, which means 81% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.
  • Finding the Confidence Interval:
  • Confidence interval (95%) for the mean value of the dependent variable for a given value of independent variable.
  • For above case when waiting time= 80 mins, the 95% prediction interval for eruption is 4.17.
  • Finding the Prediction interval:
  • Interval estimate of the dependant variable for a given value of independent variable.
  • The lower and upper limits are 3.19 and 5.15 respectively.

This is how a Simple Linear Regression is fitted in R.

Steps in Building a Multiple Linear Regression Model:

Ex: Fitting the Multiple Linear Regression model for the dataset

“Stackloss” in R.

  • Data Collection and understanding the data: 
  • Predicting the dependent variable based on the independent variable using the regression model:
  • e. stackloss = intercept + (slope*Air.Flow) + (slope * Water.Temp) + (slope*Acid.Conc.)
  • For the independent variable values, the predicted dependent variable value is 24.58.
  • Test of Significance:

 

  • If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
  • In the above case, except the Acid Concentration, all the other independent variables have a significant relationship with the dependent variable.
  • As P- value of Acid Conc. is more than 0.005, so null hypothesis can’t be rejected, so Acid Conc. has no significant relationship with the dependent variable.
  • Co-efficient of Determination – R Squared value:
  • Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
  • In above case R –squared value is 0.91, which means 91% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.
  • Finding the Confidence Interval:
  • Confidence interval (95%) for the mean value of the dependent variable for the given values of independent variables.
  • For above case, the 95% prediction interval for stackloss is

24.58.

  • Finding the Prediction interval:
  • Interval estimate of the dependant variable for the given values of independent variables.
  • The lower and upper limits are 16.46 and 32.69 respectively.

This is how a Multiple Linear Regression is fitted in R.

Contributed by: Prasand Kumar

REFERENCES:

Post a Comment