Read full version essay Factor Analysis
Factor AnalysisPrint version essay is available for you! You can search Free Term Papers and College Essay Examples written by students!.
Join Essays24.com and get instant access to Factor Analysis and over 30,000 other Papers and Essays
Autor: anton 01 June 2011
Words: 1668 | Pages: 7
The Factor analysis summarizes many variables by few factors and helps to understand the structure of a correlation matrix. It accounts for multi-collinearity among a large number of interrelated independent quantitative variables by grouping the variables into a few factors and reduces correlations.
In our case, we have countries as the units of observations. We have data on different aspects of these countries like population, density, percentage of people living in cities, religion, life expectancy, literacy rates, daily calorie intake, number of people affected from aids, fertility, death rates etc. Now, for the purpose of this lab we are taking LIFEXPF (Female Life Expectancy) as a dependent variable and running regression on that. However, before doing that we are running a factor analysis on other independent variables and grouping them into few factors and use these factor scores as independent variables for regression. This will help in reducing correlations among independent variables present in the model. The outputs from factor analysis are analyzed below in different sections followed by interpretation of the regression analysis.
1) The suitability of the data set for factor analysis (mention the correlation matrix & Bartlettâ€™s)
Here, I want to explain more about the data set I am using for factor analysis. The data set has a lot of missing information for independent as well as dependent variable. Thus, I exclude all the observations with missing cases to improve the analysis and the model.
First of all, I ran correlation matrix for all the independent variables to examine their strength of the relationship with the dependent variable LIFEXPF. From the correlation matrix, I find that variables like Population in thousand, No. of people per square km, region or economic group, aids cases, log base 10 of aids, log base 10 of population, predominant climateâ€™s correlation with LIFEXPF are not significant. Since they are not significantly correlated I exclude these variables from my model. Similarly, variables birth to death ratio and population increase per year are very weakly correlated with LIFEXPF, thus I also exclude this from factor analysis. Besides, Log base 10 of GDP is more strongly correlated than GDP itself so I take Log base 10 of GDP instead of GDP for my analysis.
The variables that are highly, significantly and positively correlated with LIFEXPF are % of people living in cities, average male life expectancy, % of people who read, daily calorie intake, log base 10 of GDP, % of males who read, and % of females who read. Whereas, the variables that are highly, significantly and negatively correlated with LIFEXPF are infant mortality, birth rate, death rate, number of aids cases per 100000 people and fertility. I include all these variables for factor analysis as these variables demonstrate multicollinearity (high correlation between each other) and show potential for factor analysis as a result of that. It is worth mentioning that good factor analysis is only suitable for data sets with high levels of multicollinearity.
In addition, KMO measure of sampling adequacy and Bartlettâ€™s Test also tests for the high levels of correlations and common variance among variables and determines the suitability of data for factor analysis. From the analysis, we get KMO measure of sampling adequacy of 0.852, which qualifies as meritorious. This indicates very high potential and suitability for good factor analysis indicating high level of common variance. The reason for this can be explained by our large sample size. Similarly, Bartlettâ€™s test of sphericity is also significant with significance level less than 0.05. Thus, this also indicates there are significant relationships among independent variables. Hence, data set is suitable for factor analysis.
2) How many factors selected and how many important factors (Eigen values and percent variance)
The following table â€œTotal Variance Explainedâ€ gives the explanation of number of factors selected, Eigen values and percentage variance.
From the table, the first four columns show the percentage of variance from the original 12 independent variables that are explained by each of the components. Here, component 1 explains about 69% of the variance, followed by components till 12 with their respective variance. Similarly Eigen values also total up to 12 in column 2.
The analysis has extracted only two components that have Eigen value more than 1. These are first two components with variance of 68.994 and 10.932 with Eigen values 8.279 and 1.312. The total variances explained by these two components are 79.926% which is acceptable as per common standard in social science i.e. 60%. The variance explained by the components is fairly strong. Last three columns shows the adjusted Eigen values % of variance explained as these are rotated sums of squares loadings after rotation. It is worth noting that total variance has not changed. Adjusted value will be taken for further analysis.
The following scree plot represents the visual interpretation of factor selection and extraction. This is a graph for total variance explained by the factors. Two factors extracted with Eigen values more than cut off value 1 are evident in the plot.
3) What â€œdimensionsâ€ do these important factors represent? How would you name them? How do individual variables load on these factors?
The following table on â€œRotated Component Matrixâ€ represents the dimensions these extracted factors represent. This also represents the factor loadings for each variable on the components or factors after rotation. Each number represents the partial correlation between the item and the rotated factor.
From the table, we can see that % of people who read, % of males who read, % of females who read etc has high positive loadings on component 1. Similarly, infant mortality, birth rate, fertility, male life expectancy, GDP, daily calorie intake, % people living in cities all have high loadings on component 1. Whereas, variables like aids cases and death rates have high loadings on component 2. Based on their loadings on each factor, they are grouped under that factor. Independent variables get grouped under the factor for which it has high loadings. The following table shows grouping of variables under each factor based on their loadings.
Factor 1 Factor 2 Loadings
% People who read 0.950
% of females who read 0.937
% of males who read 0.916
Infant mortality (deaths per 1000 live births) - 0.872
Birth rate per thousand persons - 0.861
Fertility: average number of kids - 0.836
Average male life expectancy 0.752
Log base 10 of GDP per cap 0.748
Daily calorie intake 0.698
% of people living in cities 0.673
Number of aids cases per 100,000 people 0.929
Death rate per 1000 people 0.764
Now, this table helps in interpretation of factors extracted based on their groupings. Besides, naming of factors is made easy due to this grouping. In general, we can say factor 1 consists of variables related to literacy, health and other socio economic factors. Similarly, in general factor 2 includes death related variables. Factor 1 contains variable that increases life expectancy whereas factor 2 contains variable that reduces life expectancy. Naming Factor 1 is difficult as it consists of different nature of independent variables and includes almost all socio-economic factors. Since all these promote life expectancy we can name this as Socio-economic factors of well being. As for Factor 2 we can name it as Life Expectancy Demoters.
4) Results of the regression with the factor score variables (vital stats of the model, how the residuals look, and significance of coefficients)
Now, after obtaining two distinct factors from 12 independent variables by doing factor analysis, we can run multiple regressions on dependent variable LIFEXPF with factors as independent variables. These factors will capture the underlying elements of data used to create them. The summary of outputs obtained from regression analysis is included below.
Dependent variable: LIFEXPF
Independent variables: Factor 1: Socio-economic factors of well being
Factor 2: Life expectancy Demoters
Method used: Enter
Regression Equation: Predicted LIFEXPF = 65.831 + 8.827 Factor 1 â€“ 6.335 Factor 2
Significance of coefficients: Constant and both the coefficients are significant.
Residual Plot: Residuals are fairly randomly distributed around zero. There are no obvious outliers as well. The plot also does not violate any assumptions.
From the outputs of regression analysis, we can say that the regression model has very well captured the factors and their relationship with dependent variable. Factors can predict LIFEXPF very well. R2 of 0.961 indicates a very high goodness of fit. This means 96.1% of the variance in dependent variable is explained by independent factors. Further, all the coefficients are significant as well. Residual plot is also good and acceptable. Interpretation of the model is included in the coming section.
5) Brief interpretation of the model (What are the variable dimensions of female life expectancy?)
From the model, we can say that female life expectancy is very well predicted by socio-economic factors of well being like literacy, mortality etc. and the life expectancy demoting factors. From the regression equation,
Predicted LIFEXPF = 65.831 + 8.827 Factor 1 â€“ 6.335 Factor 2
We can say that, female life expectancy is positively affected by the combination of socio-economic factors of well being. With every unit increase in socio-economic factors of well being life expectancy increases by 8.827 units. Similarly, life expectancy demoters affect female life expectancy in a negative way. With every unit increase in life demoter factors, female life expectancy decreases by 6.335 units.
If the countries want to increase female life expectancy, then their policies should chiefly focus on literacy rates, health, and economic opportunities to increase GDP etc. While it would be beneficial to raise awareness about life threatening diseases like aids to increase female life expectancy. Decrease in death rate for the entire population will also increase female life expectancy. Lastly, we can infer that female life expectancy is not the product of a single variable but takes into account a bunch of socio-economic independent variables which are highly inter-related with each other. Thus, policies as far as possible should target all these interrelated variables at once to see positive effect on female life expectancy.