lab 5_regressione lineare semplice completo
                
      
      
      
        
          
              
                 Lab 5_regressione.txt
              
              
                  —
                  Plain Text,
                  9 kB (9512 bytes)
                Lab 5_regressione.txt
              
              
                  —
                  Plain Text,
                  9 kB (9512 bytes)
              
          
          
        
      
    
    
            
Contenuto del file
Simple Linear Regression Model using R
UNIFE_ STATISTICA MULTIVARIATA (DEP. MATEMATICA) 
Mini V. OTTOBRE-2019
RESEARCH QUESTION:
does exist a linear causal relationship between the number of cakes sold in a week (by a firm) and the unit�s price (the price applied per cake)?
Let�s observe a given dataset and perform a simple linear regression analysis
#Analysis: step by step
0. LET'S PREPARE THE DATASET
1. Visualize the relationship: the scatter plot
2. Identify the estimated model
3. The model on a graph
4. Prediction: the expected Y values given a X value
5. The model�s goodness of fit
6. Graphical analysis of Linear Regression Model�s assumptions
7. what about the inference? #
_________________________________________
#0.LET'S PREPARE THE DATASET
#we upload an external dataset
#A) CHECK THE DIRECTORY PROCESS
getwd()
#B) CHANGE THE ACTUAL DIRECTORY (IF NECESSARY) FROM THE FILE BAR
#C) CHECK AGAIN THE DIRECTORY PROCESS
getwd()
#D) WE IMPORT THE DATABASE OF INTERES
cake<-read.csv2("cake_reg lin.csv")
#E) CHECK THE UPLOADED DATASET
View(cake)
#F) CHECK THE DATABASE STRUCTURE
str(cake) #this command shows the structure and characteristics of the data
head(cake) #this command shows the first six rows of our dataset
# G) ...TO BE SURE THE DATABASE IS AVAILABLE WITHIN THE R SOFTWARE FOR NEXT ANALYSIS
attach(cake)
#BECAUSE WE ARE INTERESTED IN TWO VARIABLES (UNITS AND PRICE), WE EXCLUDE THE FIRST ONE
cake=cake[,-1]
#1. Graphical observation of the data#
plot(x,y)
#What we can say about the relationship between this couple of data?#
#2. We may identify the model using two different strategies: 
a)	Following all the steps seen in theory
b)	Using the lm function in R#
#2A: Let�s follow the steps we�ve seen in theory#
x.difference=x-mean(x)                       #xi - x average #
x.difference
y.difference=y-mean(y)                       #yi - y average #
y.scarti
dev.x=sum(x.difference^2)          #total sum of (xi - x average)#
dev.x
dev.y=sum(y.difference^2)          #total sum of (yi - y average)#
dev.y
# let�s compute the total sum of the product between x and y differences#
codev.xy=sum(x.difference*y.difference)
codev.xy
#now we have all the elements to compute the coefficients of our model#
b1=codev.xy/dev.x           #SSYX/SSX #
b1
b0=mean(y)-mean(x)*b1                  # average y -b1*average x#
b0
#using those information we may transcript the equation of our estimated model #
#y= b0+b1 * xi -->  #
#we may predict the value of weekly SOLD_KAKES for a given unit price #
#before to make any prediction, It�s important to individuate the X range, given by the minimum and maximum value  that X takes in our database: we have two different possibilities: 
>max(x)
>min(x)
>range(x)
#let�s now make prediction    ?  Using the model :   prediction=b0+b1*x
#How many cakes we estimate to sell in a week in which the unit�s price is 5.3$ ?#
prediction5.3=b0+b1*5.3
prediction5.3
#please, interpret the obtained result ? when the unit�s price is 5.3$, in that week we�ll expect to sell �... cakes#
#How many cakes we estimate to sell in a week in which the unit�s price is 7.2$ ?#
Prediction7.2=b0+b1*7.2
Prediction7.2
#please, interpret the obtained result#
#--------------#
#B. let�s compute the Simple Linear Regression Model  using the R function lm()#
#the function is  lm(dependent variable (Y)~explanatory variable (X))#
#how to write �tilde� using your keyboard? alt+126 (from the numerical small keyboard on the right side)#
reg.lin=lm(y~x)
#the result is an object in R: we may visualize the performed linear regression simply by re-calling the object�s name#
reg.lin
#when we want to visualize some specified contents of our analysis 
we need to use the dollar symbol between the model�s name and 
the specified contents we are interested in $#
# i.e.  regression$specification
#for instance we may want to visualize the coefficients of our model #
reg.lin$coefficients
# definitely we have individuate the equation of our estimated model #
#__________________#
#3. Plot of our linear model#
plot(x,y)    #pairs of coordinates 
lines(x,y)   #line which link all the coordinates
abline(reg.lin)        #graphical representation of the regression line
#__________________#
# 4. Prediction: the expected Y values given a X value
?	Already seen in the 2A step
#If in a given week, the company we are working for decides to apply a unit cake�s price equals to 6.8$, how many cakes we�ll expect to sell (in that week)?#
prev6.8=b0+b1*6.8
prev6.8
#comment the results: how many cakes the company should prepare for that week?#
#_________________#
#5. The model�s goodness of fit or the coefficient of determination (R2) 
# how much of the total variation in Y is explained by our simple regression model?
#three ways to identify R2:
a)	Computing SSR/SST
b)	Checking the regression model�s output
c)	Checking the ANOVA table
#5A. let�s compute R2=SSR/SST#
dev.tot=sum((y-mean(y))^2)                      #total residuals SST
dev.disp=sum(reg.lin$residuals^2)           #residuals SSE
dev.reg=dev.tot-dev.disp                            #regression�s residuals SSR
RQ=dev.reg/dev.tot
RQ
#how we can interpret the result?
#does the model we�ve performed explain a lot of the variation in Y?
#Is it a good model or not? 
#how much of the variation in Y is not explained by the model? So, how much of unexplained variation in Y still exists? (part of variation dues to different factors or not caught by the linear relationship)
#------------#
#5B. we may obtain the value of the coefficient of determination (R2) observing the summary of our regression model ? we use the command �summary�
summary(reg.lin)
#on the penultimate row of the obtained output we�ll see the R2 value
#_________________#
#5C. . we may obtain the value of the coefficient of determination (R2) observing the ANOVA output ? we use the anova command (analysis of variance)
anova(reg.lin)
 
SSR=77991
SST=(77991+91998)=169989
R2=77991/169989=0.4588
#__________________#
#6. CHECKING THE LINEAR REGRESSION ASSUMPTIONS #
We observe one plot for each assumption: 
a)	linearity between Y and X
plot(x,y)
abline(reg.lin)
b)	independence of the error terms from the explanatory variable
e=reg.lin$residuals
plot(x,e)
c)	constant variance for all levels of X
plot(x,e)
d)	normal distribution of the error terms
hist(e)
#please, comment each plot considering the basic assumptions#
#_______________#
 7. CHEKING THE INFERENCE ON THE RELATIONSHIP BETWEEN X AN Y 
#7. Making inference: test of the linear relationship between the variables within the reality (and not only in our sample)#
#We mail follow two ways:
#a.we compute the t-statistic and the critical value, thus we compare them
#b.we observe the p-value associated with the b1 coefficient within the summary of our regression analysis
#the system of  hypothesis: 
# H0 : B1= 0 (we exclude linear relation between Y and X within the population)
# H1 : B1 ? 0 (we don�t exclude the linear relationship between Y and X within the reality)
#if Tstat > v.c. --> we refuse H0 --> linear relationship between Y and X within the rality
#Tstat = (b1 - B1)/Sb1 --> where Sb1 = SSE^2/dev X
# let�s compute the elements using R: 
dev.disp=sum(reg.lin$residuals^2)           #residuals SSE
dev.disp    
x.difference=x-mean(x)                       #xi - x average #
dev.x=sum(x.difference^2)          #total sum of (xi - x average)#
dev.x         
b1=47.577
b1            #already computed to individuate the SLR coefficients
#thus, the T-statistics is: 
tstat= b1/sqrt(dev.disp/(32*dev.x))
tstat           # equals 5.208#
#let�s now calculate the t a/2  critical value, assuming a level of significance a equals to 0.05 (so a level of confidence = 1-0.05= 95% )#
#vc = ta/2 --> alfa = 0.01--> alfa/2 = 0.025; n-2 = 34-2=32#
#the command for the critical value  t sub a over 2 : qt(a/2, degree of freedom)
qt(0.025,32)      #il v.c. -2.037
#let�s compare the obtained results (in absolute value) for making a final decision: 
#t-stat > ta/2   5.208>2.037 --> we refuse H0 thus
#it exists an empirical evidence that at the 95% of confidence there is a linear relationship between number of sold cakes and unit�s cake prices#
#Please observe the summary to individuate the confidence level of our inference: 
summary(reg.lin)
## 7. BIS : INFERENCE ABOUT RHO
# #TOPIC:  the strength of the correlation between x and y within the population 
#we test the system of hypothesis based on the correlation coefficient RHO (?) #
#H0: RHO=0 (no correlation)
#H1: RHO dif. from 0 (correlazione �with given intensity)
#if t-statr > v.c. --> we reject H0 --> thus, it exists a correlation �with a given intensity � within the reality between x and y
#tstat.r=(r-RHO)/root.sq of ((1-r^2)/n-2))    c.v.:  a/2  and d.f = n-2
# we find the R^2 in summary
RQ= valore
r=sqrt(RQ)
r #there is a "moderate/strong" positive correlation
num=r #numerator of tstat.r
den=sqrt((1-(RQ))/32) # denominator of tstat.r
tstat.r=num/den
tstat.r 
vc.r=qt(0.025,32)
vc.r #-2.037
# (ABS VALUE!): tstat.r>C.V.r --> we reject H0 --> it exists a correlation �whit that given strenght- within the reality 
# significant correlation between  x and  y #
             
    