Simple Linear Regression
Essay by horlarwahley • March 9, 2017 • Course Note • 1,420 Words (6 Pages) • 995 Views
- SIMPLE LINEAR REGRESSION
Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. One variable, denoted x, is regarded as the predictor, explanatory, or independent variable, the other variable, denoted y, is regarded as the response, outcome, or dependent variable. Simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simple refers to the fact that the outcome variable is related to a single predictor.
Correlation is excellent for showing association between two variables. Simple Linear regression takes correlation's ability to show the strength and direction of an association a step further by allowing the researcher to use the pattern of previously collected data to build a predictive model.
- LINEAR REGRESSION ANALYSIS
Linear regression is the most basic type of regression and commonly used predictive analysis. The overall idea of regression is to examine two things: does a set of predictor variables do a good job in predicting an outcome variable? Is the model using the predictors accounting for the variability in the changes in the dependent variable? Which variables in particular are significant predictors of the dependent variable? And in what way do they--indicated by the magnitude and sign of the beta estimates--impact the dependent variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. What is the regression equation that shows how the set of predictor variables can be used to predict the outcome? The simplest form of the equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent score, c = constant, b = regression coefficients, and x = independent variable.
However, linear regression analysis consists of more than just fitting a linear line through a cloud of data point. It consists of three stages.
- Analyzing the correlation and directionality.
- Estimating the model i.e fitting the line. And
- Evaluating the validity and usefulness of the model.
Three major uses for regression analysis are:
(1) causal analysis.
(2) forecasting an effect.
(3) trend forecasting.
Other than correlation analysis, which focuses on the strength of the relationship between two or more variables. Regression analysis assumes a dependence or causal relationship between one or more independent variables and one dependent variable.
Firstly, the regression might be used to identify the strength of the effect that the independent variables have on a dependent variable. Typical questions are what is the strength of relationship between dose and effect, sales and marketing spend, age and income.
Secondly, it can be used to forecast effects or impact of changes. That is, the regression analysis helps us to understand how much the dependent variable change with a change in one or more independent variables. Typical questions are, "how much additional Y do I get for one additional unit X?"
Thirdly, regression analysis predicts trends and future values. The regression analysis can be used to get point estimates. Typical questions are, what will the price for gold be in 6 months from now? What is the total effort for a task X?
- FIVE ASSUMPTIONS OF CLASSICAL REGRESSION MODEL
The following are the five key assumptions of classical regression model:
- Linear relationship:
Linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present. [pic 1]
[pic 2]
- Multivariate normality:
The linear regression analysis requires all variables to be multivariate normal. This assumption can best be checked with a histogram and a fitted normal curve or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the Kolmogorov-Smirnof test. When the data is not normally distributed a non-linear transformation, e.g., log-transformation might fix this issue, however it can introduce effects of multicollinearity.
- No or little multicollinearity:
Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent from each other. A second important independence assumption is that the error of the mean has to be independent from the independent variables.
Multicollinearity might be tested with four central criteria:
- Correlation matrix
- Tolerance
- Variance inflation factor (VIF)
- Condition index.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to insure independence of the factors in the linear regression analysis.
- No auto-correlation:
Linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each other. In other words when the value of y(x+1) is not independent from the value of y(x). This for instance typically occurs in stock prices, where the price is not independent from the previous price.\
- Homoscedasticity:
The last assumption the linear regression analysis makes is homoscedasticity. The scatter plot is good way to check whether homoscedasticity (that is the error terms along the regression are equal) is given. If the data is heteroscedastic the scatter plots looks like the following examples:
[pic 3]
The Goldfeld-Quandt Test can test for heteroscedasticity. The test splits the data in high and low value to see if the samples are significantly different. If homoscedasticity is present, a non-linear correction might fix the problem.[pic 4]
...
...