Salta ai contenuti. | Salta alla navigazione

Strumenti personali

lab 4_LRM

Plain Text icon Lab_lecture 4.txt — Plain Text, 6 kB (6588 bytes)

Contenuto del file

Simple Linear Regression Model using R
UNIFE 
Spring Semester
Mini V. 20-02-2019

RESEARCH QUESTION:
does exist a linear causal relationship between the number of cakes sold in a week (by a firm) and the unit�s price (the price applied per cake)?


Let�s observe a given dataset and perform a simple linear regression analysis

#Analysis: step by step
0. LET'S PREPARE THE DATASET
1. Visualize the relationship: the scatter plot
2. Identify the estimated model
3. The model on a graph
4. Prediction: the expected Y values given a X value
5. The model�s goodness of fit
6. Graphical analysis of Linear Regression Model�s assumptions
7. what about the inference? #











#0.LET'S PREPARE THE DATASET
#we upload an external dataset

#A) CHECK THE DIRECTORY PROCESS
getwd()
#B) CHANGE THE ACTUAL DIRECTORY (IF NECESSARY) FROM THE FILE BAR
#C) CHECK AGAIN THE DIRECTORY PROCESS
getwd()
#D) WE IMPORT THE DATABASE OF INTERES
cake<-read.csv2("cake_reg lin.csv")
#E) CHECK THE UPLOADED DATASET
View(cake)
#F) CHECK THE DATABASE STRUCTURE
str(cake) #this command shows the structure and characteristics of the data
head(cake) #this command shows the first six rows of our dataset
# G) ...TO BE SURE THE DATABASE IS AVAILABLE WITHIN THE R SOFTWARE FOR NEXT ANALYSIS
attach(cake)

#BECAUSE WE ARE INTERESTED IN TWO VARIABLES (UNITS AND PRICE), WE EXCLUDE THE FIRST ONE
cake=cake[,-1]




#1. Graphical observation of the data#

plot(x,y)


#What we can say about the relationship between this couple of data?#

#2. We may identify the model using two different strategies: 
a)	Following all the steps seen in theory
b)	Using the lm function in R#

#2A: Let�s follow the steps we�ve seen in theory#

x.difference=x-mean(x)                       #xi - x average #
x.difference
y.difference=y-mean(y)                       #yi - y average #
y.scarti

dev.x=sum(x.difference^2)          #total sum of (xi - x average)#
dev.x
dev.y=sum(y.difference^2)          #total sum of (yi - y average)#
dev.y

# let�s compute the total sum of the product between x and y differences#

codev.xy=sum(x.difference*y.difference)
codev.xy

#now we have all the elements to compute the coefficients of our model#

b1=codev.xy/dev.x           #SSYX/SSX #
b1

b0=mean(y)-mean(x)*b1                  # average y -b1*average x#
b0

#using those information we may transcript the equation of our estimated model #
#y= b0+b1 * xi -->  #

#we may predict the value of weekly SOLD_KAKES for a given unit price #
#before to make any prediction, It�s important to individuate the X range, given by the minimum and maximum value  that X takes in our database: we have two different possibilities: 
>max(x)
>min(x)
>range(x)

#let�s now make prediction    ?  Using the model :   prediction=b0+b1*x
#How many cakes we estimate to sell in a week in which the unit�s price is 5.3$ ?#
prediction5.3=b0+b1*5.3
prediction5.3

#please, interpret the obtained result ? when the unit�s price is 5.3$, in that week we�ll expect to sell �... cakes#
#How many cakes we estimate to sell in a week in which the unit�s price is 7.2$ ?#

Prediction7.2=b0+b1*7.2
Prediction7.2
#please, interpret the obtained result#
#--------------#
#B. let�s compute the Simple Linear Regression Model  using the R function lm()#

#the function is  lm(dependent variable (Y)~explanatory variable (X))#
#how to write �tilde� using your keyboard? alt+126 (from the numerical small keyboard on the right side)#

reg.lin=lm(y~x)

#the result is an object in R: we may visualize the performed linear regression simply by re-calling the object�s name#

reg.lin

#when we want to visualize some specified contents of our analysis 
we need to use the dollar symbol between the model�s name and 
the specified contents we are interested in $#
# i.e.  regression$specification
#for instance we may want to visualize the coefficients of our model #
reg.lin$coefficients

# definitely we have individuate the equation of our estimated model #
#__________________#

#3. Plot of our linear model#

plot(x,y)    #pairs of coordinates 

lines(x,y)   #line which link all the coordinates

abline(reg.lin)        #graphical representation of the regression line

#__________________#

# 4. Prediction: the expected Y values given a X value
?	Already seen in the 2A step
#If in a given week, the company we are working for decides to apply a unit cake�s price equals to 6.8$, how many cakes we�ll expect to sell (in that week)?#

prev6.8=b0+b1*6.8
prev6.8
#comment the results: how many cakes the company should prepare for that week?#
#_________________#

#5. The model�s goodness of fit or the coefficient of determination (R2) 
# how much of the total variation in Y is explained by our simple regression model?
#three ways to identify R2:
a)	Computing SSR/SST
b)	Checking the regression model�s output
c)	Checking the ANOVA table

#5A. let�s compute R2=SSR/SST#

dev.tot=sum((y-mean(y))^2)                      #total residuals SST
dev.disp=sum(reg.lin$residuals^2)           #residuals SSE
dev.reg=dev.tot-dev.disp                            #regression�s residuals SSR

RQ=dev.reg/dev.tot
RQ

#how we can interpret the result?
#does the model we�ve performed explain a lot of the variation in Y?
#Is it a good model or not? 
#how much of the variation in Y is not explained by the model? So, how much of unexplained variation in Y still exists? (part of variation dues to different factors or not caught by the linear relationship)

#------------#
#5B. we may obtain the value of the coefficient of determination (R2) observing the summary of our regression model ? we use the command �summary�

summary(reg.lin)
#on the penultimate row of the obtained output we�ll see the R2 value
#_________________#
#5C. . we may obtain the value of the coefficient of determination (R2) observing the ANOVA output ? we use the anova command (analysis of variance)
anova(reg.lin)
 
SSR=77991
SST=(77991+91998)=169989
R2=77991/169989=0.4588
#__________________#

#6. CHECKING THE LINEAR REGRESSION ASSUMPTIONS #
We observe one plot for each assumption: 
a)	linearity between Y and X
plot(x,y)
abline(reg.lin)
b)	independence of the error terms from the explanatory variable
e=reg.lin$residuals
plot(x,e)
c)	constant variance for all levels of X
plot(x,e)
d)	normal distribution of the error terms
hist(e)
#please, comment each plot considering the basic assumptions#
#_______________#