Regression Analysis
Essay by 24 • June 29, 2011 • 1,667 Words (7 Pages) • 1,692 Views
The Factor analysis summarizes many variables by few factors and helps to understand the structure of a correlation matrix. It accounts for multi-collinearity among a large number of interrelated independent quantitative variables by grouping the variables into a few factors and reduces correlations.
In our case, we have countries as the units of observations. We have data on different aspects of these countries like population, density, percentage of people living in cities, religion, life expectancy, literacy rates, daily calorie intake, number of people affected from aids, fertility, death rates etc. Now, for the purpose of this lab we are taking LIFEXPF (Female Life Expectancy) as a dependent variable and running regression on that. However, before doing that we are running a factor analysis on other independent variables and grouping them into few factors and use these factor scores as independent variables for regression. This will help in reducing correlations among independent variables present in the model. The outputs from factor analysis are analyzed below in different sections followed by interpretation of the regression analysis.
1) The suitability of the data set for factor analysis (mention the correlation matrix & Bartlett’s)
Here, I want to explain more about the data set I am using for factor analysis. The data set has a lot of missing information for independent as well as dependent variable. Thus, I exclude all the observations with missing cases to improve the analysis and the model.
First of all, I ran correlation matrix for all the independent variables to examine their strength of the relationship with the dependent variable LIFEXPF. From the correlation matrix, I find that variables like Population in thousand, No. of people per square km, region or economic group, aids cases, log base 10 of aids, log base 10 of population, predominant climate’s correlation with LIFEXPF are not significant. Since they are not significantly correlated I exclude these variables from my model. Similarly, variables birth to death ratio and population increase per year are very weakly correlated with LIFEXPF, thus I also exclude this from factor analysis. Besides, Log base 10 of GDP is more strongly correlated than GDP itself so I take Log base 10 of GDP instead of GDP for my analysis.
The variables that are highly, significantly and positively correlated with LIFEXPF are % of people living in cities, average male life expectancy, % of people who read, daily calorie intake, log base 10 of GDP, % of males who read, and % of females who read. Whereas, the variables that are highly, significantly and negatively correlated with LIFEXPF are infant mortality, birth rate, death rate, number of aids cases per 100000 people and fertility. I include all these variables for factor analysis as these variables demonstrate multicollinearity (high correlation between each other) and show potential for factor analysis as a result of that. It is worth mentioning that good factor analysis is only suitable for data sets with high levels of multicollinearity.
In addition, KMO measure of sampling adequacy and Bartlett’s Test also tests for the high levels of correlations and common variance among variables and determines the suitability of data for factor analysis. From the analysis, we get KMO measure of sampling adequacy of 0.852, which qualifies as meritorious. This indicates very high potential and suitability for good factor analysis indicating high level of common variance. The reason for this can be explained by our large sample size. Similarly, Bartlett’s test of sphericity is also significant with significance level less than 0.05. Thus, this also indicates there are significant relationships among independent variables. Hence, data set is suitable for factor analysis.
2) How many factors selected and how many important factors (Eigen values and percent variance)
The following table “Total Variance Explained” gives the explanation of number of factors selected, Eigen values and percentage variance.
From the table, the first four columns show the percentage of variance from the original 12 independent variables that are explained by each of the components. Here, component 1 explains about 69% of the variance, followed by components till 12 with their respective variance. Similarly Eigen values also total up to 12 in column 2.
The analysis has extracted only two components that have Eigen value more than 1. These are first two components with variance of 68.994 and 10.932 with Eigen values 8.279 and 1.312. The total variances explained by these two components are 79.926% which is acceptable as per common standard in social science i.e. 60%. The variance explained by the components is fairly strong. Last three columns shows the adjusted Eigen values % of variance explained as these are rotated sums of squares loadings after rotation. It is worth noting that total variance has not changed. Adjusted value will be taken for further analysis.
The following scree plot represents the visual interpretation of factor selection and extraction. This is a graph for total variance explained by the factors. Two factors extracted with Eigen values more than cut off value 1 are evident in the plot.
__
3) What “dimensions” do these important factors represent? How would you name them? How do individual variables load on these factors?
The following table on “Rotated Component Matrix” represents the dimensions these extracted factors represent. This also represents the factor loadings for each variable on the components or factors after rotation. Each number represents the partial correlation between the item and the rotated factor.
From the table, we can see that % of people who read, % of males who read, % of females who read etc has high positive loadings on component 1. Similarly, infant mortality, birth rate, fertility, male life expectancy, GDP, daily calorie intake, % people living in cities all have high loadings on component 1. Whereas, variables like aids cases and death rates have high loadings on component 2. Based on their loadings on each factor, they are grouped under that factor. Independent variables get grouped under the factor for which it has high loadings. The following table shows grouping of variables under each factor based on their loadings.
Factor 1 Factor 2 Loadings
%
...
...