## Step by Step Linear Regression in R

**Linear Regression in R**

Linear regression builds up a relationship between dependent/target variable (Y) and one or more independent variables/predictors (X) utilizing a best fit straight line (Regression line).

The regression line is represented by a linear equation **Y = a + bX**.

Where,

- = Y intercept (Value of Y when X is equal to 0)
- = the change in Y for each 1 increment change in X

__Simple Linear Regression__**: **One Dependent Variable and one independent variable.

ü Regression equation:

**Y’= a + bX **

Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b = the change in Y for each 1 increment change in X **Linear regression makes several key assumptions: **

- Linear relationship.
- Multivariate normality.
- No or little multicollinearity.
- No auto-correlation.

__Multiple Linear Regression__**: **One dependent variable and more than one independent variable.

ü Multiple Linear Regression equation:

**Y’ = a + b _{1} X_{1} + b_{2}X_{2} + b_{3}X_{3 }**

** **Where,

Y’ = Dependent / Target Variable (Predicted Value of Y)

X_{1}, X_{2}, X_{3} = Independent / Predictor Variable a = Y intercept (Value of Y when X is equal to 0)

b_{1} = the change in Y for each 1 increment change in X_{1} b_{2 }= the change in Y for each 1 increment change in X_{2} b_{3 }= the change in Y for each 1 increment change in X_{3}

**Multiple linear regression analysis makes several key assumptions: **

- Linear relationship
- Multivariate normality
- No or little multicollinearity
- No auto-correlation
- Homoscedasticity

__Steps in Building a Simple Linear Regression model__**: **

** **

Ex: Fitting the Simple Linear Regression model for the dataset

“faithful” in R.

**Data Collection and understanding the data:****Fitting Linear Regression Model:****Predicting the dependent variable based on the independent variable using the regression model:**- e. eruption = intercept + (slope*waiting)
- For the independent variable value of waiting = 80, the predicted dependent variable value is 4.17.
**Test of Significance:**

- If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
- As P-value < 0.005, null hypothesis is rejected and there exists a significant relationship between the variables.
**Co-efficient of Determination – R Squared value:**- Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
- R value lies between 0and 1.
- If R=1.00, there exists relationship between dependent and independent variables. But if R=0.00, then no relationship exists.
- In above case R –squared value is 0.81, which means 81% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.
**Finding the Confidence Interval:**- Confidence interval (95%) for the mean value of the dependent variable for a given value of independent variable.
- For above case when waiting time= 80 mins, the 95% prediction interval for eruption is 4.17.
**Finding the Prediction interval:**- Interval estimate of the dependant variable for a given value of independent variable.
- The lower and upper limits are 3.19 and 5.15 respectively.

This is how a Simple Linear Regression is fitted in R.

__Steps in Building a Multiple Linear Regression__** Model: **

Ex: Fitting the Multiple Linear Regression model for the dataset

“Stackloss” in R.

**Data Collection and understanding the data:****Predicting the dependent variable based on the independent variable using the regression model:**- e. stackloss = intercept + (slope*Air.Flow) + (slope * Water.Temp) + (slope*Acid.Conc.)
- For the independent variable values, the predicted dependent variable value is 24.58.
**Test of Significance:**

** **

- If the P value is less than 0.005, we can reject the null hypothesis i.e. there is a significant relationship between the independent and dependent variables.
- In the above case, except the Acid Concentration, all the other independent variables have a significant relationship with the dependent variable.
- As P- value of Acid Conc. is more than 0.005, so null hypothesis can’t be rejected, so Acid Conc. has no significant relationship with the dependent variable.
**Co-efficient of Determination – R Squared value:**- Based on the R squared value, we can explain how well the model explains the data and the percentage of differences that are explained by this model.
- In above case R –squared value is 0.91, which means 91% of variance in the dependent variable is explained by the model and the remaining 19% which is not explained is its residual or error term.

**Finding the Confidence Interval:**- Confidence interval (95%) for the mean value of the dependent variable for the given values of independent variables.
- For above case, the 95% prediction interval for stackloss is

24.58.

**Finding the Prediction interval:**- Interval estimate of the dependant variable for the given values of independent variables.
- The lower and upper limits are 16.46 and 32.69 respectively.

This is how a Multiple Linear Regression is fitted in R.

Contributed by: Prasand Kumar

REFERENCES:

- http://www.statisticssolutions.com/assumptions–of–linear–regression/
- https://www.youtube.com/watch?v=RdR1lRQ_WQc&list=PLUgZaFoyJa fgYrEycCrpnB008E9mxxli5&index=1
- https://www.youtube.com/watch?v=dzgiPNllwmc&list=PLUgZaFoyJafg YrEycCrpnB008E9mxxli5&index=2
- https://en.wikipedia.org/wiki/Linear_regression#Assumptions
- http://people.duke.edu/~rnau/testing.htm