use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear
* data file that was created by randomly sampling 400 elementary schools from the California Department of EducationÕs API 2000 dataset.  This data file contains a measure of school academic performance as well as other attributes of the elementary schools, such as, class size, enrollment, poverty, etc.
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi

save elemapi
regress api00 acs_k3 meals full
 measure the academic performance of the school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) Ð which is an indicator of poverty, and the percentage of teachers who have full teaching credentials (full). 
* measure the academic performance of the school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) Ð which is an indicator of poverty, and the percentage of teachers who have full teaching credentials (full). 
* From these results, we would conclude that lower class sizes are related to higher performance, that fewer students receiving free meals is associated with higher performance, and that the percentage of teachers with full credentials was not related to academic performance in the schools. 
d
*we have 400 obs. however, the first regression showed 313 obs
codebook api00 acs_k3 meals full yr_rnd
*acs_k3 (average class size)  ranges from -21 to 25. VERY ODD!
*it also has 2 missing values
* there are 85 missing in the meals var
*[the variable yr_rnd is coded 0=No (not year round) and 1=Yes (year round)]
[308 are non-year round and 92 are year round, and none are missing]

*[308 are non-year round and 92 are year round, and none are missing]
summarize api00 acs_k3 meals full
sum  acs_k3 , detail
ta acs_k3
list snum dnum acs_k3 if acs_k3 < 0
*they all come from the same district
histogram acs_k3
graph box acs_k3
stem full
* It shows 104 observations where the percent with a full credential is less than one.  This is over 25% of the schools, and seems very unusual.
ta full
*LetÕs see which district(s) these data came from.


tabulate dnum if full <= 1
count if dnum==401
*All of the observations from this district seem to be recorded as proportions instead of percentages.  
graph matrix api00 acs_k3 meals full, half
  

*The corrected version of the data is called elemapi2.  LetÕs use that data file and repeat our analysis and see if the results are the same as our original analysis. First, letÕs repeat our original regression analysis below.

regress api00 acs_k3 meals full 
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear
regress api00 acs_k3 meals full 


save elemapi2
*Simple Linear Regression
*In this type of regression, we have only one predictor variable. This variable may be continuous, meaning that it may assume all values within a range, for example, age or height, or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1
regress api00 enroll

*The t-test for enroll equals -6.70, and is statistically significant, meaning that the regression coefficient for enroll is significantly different from zero. Note that (-6.70)2 = 44.89, which is the same as the F-statistic
*The coefficient for enroll is -.1998674, or approximately -.2, meaning that for a one unit increase in enroll, we would expect a .2-unit decrease in api00
predict fv
list api00 fv in 1/10

scatter api00 enroll
*We can combine scatter with lfit to show a scatterplot with fitted values.

twoway (scatter api00 enroll) (lfit api00 enroll)
 
twoway (scatter api00 enroll, mlabel(snum)) (lfit api00 enroll)
  
*This allows us to see, for example, that one of the outliers is school 2910.
predict e, residual

scatter api00 enroll
twoway (scatter api00 enroll) (lfit api00 enroll)
predict e, resid
save "/Users/roiraca/Desktop/desktop/corso sustainability of public policy Ferrara/ols/elemapi3.dta"
clear
*Multiple regression
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2

regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll
*As with the simple regression, we look to the p-value of the F-test to see if the overall model is significant. With a p-value of zero to four decimal places, the model is statistically significant. The R-squared is 0.8446, meaning that approximately 84% of the variability of api00 is accounted for by the variables in the model. In this case, the adjusted R-squared indicates that about 84% of the variability of api00 is accounted for by the model, even after taking into account the number of predictor variables in the model.


*The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant.
*The beta coefficients are used by some researchers to compare the relative strength of the various predictors within the model. Because the beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. In other words, the beta coefficients are the coefficients that you would obtain if the outcome and predictor variables were all transformed standard scores, also called z-scores, before running the regression.


regress api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, beta
test ell==0
* a more interesting test would be to see if the contribution of class size is significant.  Since the information regarding class size is contained in two variables, acs_k3 and acs_46, we include both of these with the test command.


test acs_k3 acs_46

* One way to think of this, is that there is a significant difference between a model with acs_k3 and acs_46 as compared to a model without them, i.e., there is a significant difference between the ÒfullÓ model and the ÒreducedÓ models.


correlate api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll

*If we look at the correlations with api00, we see meals and ell have the two strongest correlations with api00.
*We can also use the pwcorr command to do pairwise correlations. The most important difference between correlate and pwcorr is the way in which missing data is handled. With correlate, an observation or case is dropped if any variable has a missing value, in other words, correlate uses listwise , also called casewise, deletion. pwcorr uses pairwise deletion, meaning that the observation is dropped only if there is a missing value for the pair of variables being correlated. Two options that you can use with pwcorr, but not with correlate, are the sig option, which will give the significance levels for the correlations and the obs option, which will give the number of observations used in the correlation. 
pwcorr api00 ell meals yr_rnd mobility acs_k3 acs_46 full emer enroll, obs sig
*we will focus on regression diagnostics to verify whether your data meet the assumptions of linear regression.  Here, we will focus on the issue of normality.
*the residuals need to be normally distributed.  In fact, the residuals need to be normal only for the t-tests to be valid
histogram enroll
*We can use the normal option to superimpose a normal curve on this graph and the bin(20) option to use 20 bins.  The distribution looks skewed to the right.

histogram enroll, normal bin(20)

histogram enroll, normal bin(20)

histogram enroll, normal bin(20) xlabel(0(100)1600)

kdensity enroll, normal 

*Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms. Stata implements kernel density plots with the kdensity command.
graph box enroll
*qnorm is sensitive to non-normality near the tails, and indeed we see considerable deviations from normal, the diagonal line, in the tails. This plot is typical of variables that are strongly skewed to the right.

qnorm api00
  
*Finally, the normal probability plot is also useful for examining the distribution of variables.  pnorm  is sensitive to deviations from normality nearer to the center of the distribution. Again, we see indications of non-normality in enroll.

pnorm enroll
  
*Having concluded that enroll is not normally distributed, how should we address this problem?  First, we may try entering the variable as-is into the regression, but if we see problems, which we likely would, then we may try to transform enroll to make it more normally distributed.  Potential transformations include taking the log, the square root or raising the variable to a power. Selecting the appropriate transformation is somewhat of an art. Stata includes the ladder and gladder commands to help in the process. Ladder reports numeric results and gladder produces a graphic display. LetÕs start with ladder and look for the transformation with the smallest chi-square.

ladder enroll
gladder enroll
*This also indicates that the log transformation would help to make enroll more normally distributed.
generate lenroll = log(enroll)  

hist lenroll, normal
*1.7 Self Assessment
Make five graphs of api99: histogram, kdensity plot, boxplot, symmetry plot and normal quantile plot.
What is the correlation between api99 and meals?
Regress api99 on meals. What does the output tell you?
Create and list the fitted (predicted) values.
Graph meals and api99 with and without the regression line.
Look at the correlations among the variables api99 meals ell avg_ed using the corr and pwcorr commands. Explain how these commands are different. Make a scatterplot matrix for these variables and relate the correlation results to the scatterplot matrix.
Perform a regression predicting api99 from meals and ell. Interpret the output.

*1.7 Self Assessment
delimit cr;

log close
clear
exit