Multiple Regression Analysis on Store Sites
Essay by rabbitcff • October 28, 2015 • Essay • 1,176 Words (5 Pages) • 1,249 Views
THE USE OF MULTIPLE REGRESSION ANALYSIS IN LOCATING NEW STORE
Feifei Chen
U28093738
Boston University
Abstract:
The purpose of this analysis is to build two models in that can predict the sales using data from current store based on statistical and managerial tools and then to interpret and summarize the results of the models to provide managerial recommendations. One such the powerful instrument is multiple regression analysis.
Review:
The multiple regression model studies the linear relationship between one dependent variable and two or more independent variables. The general form of the model is:
Y=b+m1X1+m2X2+m3X3…., where Y is the dependent or explained variable and Xn are the independent or explanatory variables. The symbol b represents the disturbance of the model or the residual variable. The set of independent variables explain to a certain extend the dependent variable; the rest of it is contained within the disturbance. The m1, m2 and m3…can tell you the marginal relationship between each of the X-variables and the Y variable and these can reveal managerially relevant facts about your data. They tell you the difference in the Y variable you expect to see associated with a unit difference in a given X variable, when all other X variables stay the same1.
According to Peckoz2, the assumptions for multiple regression line to be a reasonable tool for forecasting are:
- Linearity. Each variable ideally should have a roughly linear relationship either gently sloping up or gently slop down.
- Non multicollinearity. If you have two variables that are almost perfectly correlated with each other, don’t put them both into a regression equation as they would make the equation much unreliable.
- Technical assumptions:
- There is no obvious trend or pattern in the forecasting errors. b) Forecasting errors should approximately follow a normal distribution with roughly the same standard deviation over the whole range of forecasts.
Analysis Methodology:
Firstly we use demographic data to build the regression model. There are about 30 demographic variables such as the store’s size, the sites and so on that we assume they have relationships with the sales. According the assumption 1, firstly, we have to see how each individual variable relates to what we are going to forecast--the sales of the store. To observe the relationships, we draw scatter plots of each variable with sales.
We can see from the scatter plots that many variables have linear relationship with sales (To make things simple, we just listed a few of the scatter plots). But for some variables, there are something like polynomials relationship between them and the sales. So we transform such variables into X^2 format to see the linearity between Y variable and the transformed X variables.
[pic 1]
[pic 2][pic 3][pic 4]
Figure 1. Scatter plots for variable versus sales
It turned out that for variables medianinc and %inc 10-14, their X^2 have better linear relationships with the sales. So we would use X^2 of these two variables for the next step.
Secondly, we run stepwise procedure to produce the regression equation. In this approach, we first start off with the variable that is most correlated with the dependent variable. Then at each step, we add the most significant of the remaining variables and remove any variables that has become insignificant3. And we rely on t-value to see if a variable is significant or not. We get the following figures from stepwise procedure:
[pic 5]
Figure 2. Regression results by stepwise approach
And then we check the technical assumption. We create a graph of residuals versus forecasted figures and a histogram for the residuals in Figure 4. The residuals is the difference between the actual sales figure and the forecasted sales figure.
[pic 8][pic 9][pic 6][pic 7]
Figure 3 Analysis of the residuals
We see that the residual roughly follow a normal distribution. And on the scatter plot above, there is no obvious pattern in the residuals and the vertical spread seems the same approximately. So we conclude that relying on t-value to see if a variable is significant or not is reliable in this case.
Now, we interpret the equation and summarize the managerial implications.
The final equation we obtained was the following, with an R2 of 0.61 and a standard error of 7072:
Forecasted sales=33148+50*%inc10-14^2-283*inc14-20+0.125*median home-86*%owner-284*%freezer-909*sechome-263*%sch9-11+0.0095*population+54*selling_sqrft. The R2 tells us that variation in %inc10-14^2, %inc14-20, medianhome, %owner, %freezer, sechome, %sch9-11, population and selling_sqrt can explain about 61% of the variation in sales. The 7072 tells us forecasts using this regression equation generally are around about $7072 from the actual sales figures and almost all are within the margin error of 2*7072=$14144.
...
...