Salta ai contenuti. | Salta alla navigazione

Strumenti personali

Lab 11_ Hierarchical Cluster Analysis

Plain Text icon Lab 11_ cluster analysis.txt — Plain Text, 8 kB (8325 bytes)

Contenuto del file

#SEB UNIFE 2019  -- Mini V.

###### Hierarchical Cluster Analysis #######

#step-by-step technique

#0. database creation/preparation
1. checking variables and observation
2. create and comment the plot
3. observe the distance matrix
4. normalize the data
5. performing hierarchical cluster (using dendogram)
6. comparison of two hierarchical structure
7. characterize the clusters and check for outliers
8. how to decide the "right" number of clusters?


######## 0. database creation and/or preparation
getwd()
#change directory

wh=read.csv2("wh.csv")
attach(wh)

######## 1. checking variables and observation
str(wh)
View(wh)

######### 2. create and comment the plot
pairs(x=wh,panel=panel.smooth)
#or you can use:
plot(peso~altezza,wh)
#we may add the label m for male and f for female
with(wh,text(peso~altezza,labels=sex))
#we may adjust the labels considering their position and the text dimension
with(wh,text(peso~altezza,labels=sex,pos=1,cex=.6))
#pos=1 means next to the dot (lower the number , closer the label to the dot)
#cex=.6  indicates the dimension of the text (higher the number, bigger the text)

#please, comment the scatter plot looking at the groups you may individuate. 

########## 3. observe the matrix distance (Euclidean distance)
#we want just quantitative variables

wh1=wh[,-1]
wh2=wh1[,-3]
str(wh2)
#all the variables are numeric

hw2.dist=dist(wh2)
hw2.dist

######## 4. normalize the data
#the average value of each variables will be 0 and 
#the standard variation will be approximately = 1
#the standardized value = (observed value-mean)/standard deviation

#you can compute all the elements using R: 
sd(peso)
mean(peso)

sd(altezza)
mean(altezza)

#and ten compute each standardized values 
#i.e. for the 1 student: Zpeso=(60-65.48)/12.94
 
#or we can use the command scale in R
#obviously we need to use only numeric variables

z=wh2

m=apply(z,2,mean)
s=apply(z,2,sd)
z=scale(z,m,s)

#now we can calculate the distance
#we choose to calculate the Euclidean distance

distance=dist(z)
distance

#NOTE: by default R uses Euclidean method (without specification)
#if you want to use different distance method you should specify that: 

distance2=dist(z,method="manhattan")
distance3=dist(z,method="minkowski")

#etc� :  to visualize all the possibilities please write: help(distance)


#the matrix is very big, so let me make it more compact
#we use the print command:
print(distance,digits=1)

#this is the Euclidean distance between all the records in our dataset
#it shows how close/how far are each others

#for instance 4.803 is a quite high number linking the 53th and the 47th students: 
#it means that those students are very dissimilar in terms of those variables
# because the distance is quite large

#on the other hand: 0.129 tells us that the 58th and 46th students are
#very similar in terms of the observed variables (h and w)

########### 5. Performing Hierarchical cluster (using dendogram)

#we decide to use complete linkage
#the command to perform a hierarchical cluster is: hclust
#the argument of that command is the distance matrix

######## 5.A Performing hierarchical cluster using complete linkage
#we create an object called hc.c in which .c denote a complete linkage
#by default R use a complete linkage method

hc.c=hclust(distance)


#to create a cluster dendogram we use plot command
#the DENDOGRAM: 
#- is the main tool for looking a hierarchical cluster solution
#- is a tree-structure to visualize the results of a Cluster calculation
#- on the x-axis: the objects wich are clustered
#- on the y-axis the distance at which the cluster is formed

plot(hc.c)
hc.c

#initially each student is treated as a single cluster
#and then, using distance, we group those students who are closest
#for instance, the cases 21 and 47
#the process continues until all the students are grouped into a unique cluster

#if we want to visualize the statistical units with their label (for us is the ID):

plot(hc.c,labels=wh$ID)

#or    plot(hc.c,labels=hc.c$ID)


#we may want to align all the cases (and we align i.e. at level -1)
plot(hc.c,hang=-1)

#######5.B Performing hierarchical cluster using average linkage

hc.a=hclust(distance, method="average")
hc.a

plot(hc.a)

plot(hc.a,labels=wh$ID)

plot(hc.a,hang=-1)

#as you can see the clusters are slightly different

#to see all the possible method please check: help(hclust)
#for instance if we use the method ward.D it tends to produce clusters
#of fairly equal size and can be useful when other methods
#find clusters containing  just a few observations

 

########6. compare two hierarchical structure

#let's compare those cluster looking at their members

member.c=cutree(hc.c,4)
member.a=cutree(hc.a,4)
table(member.c,member.a)

#observe and comment the results: 
#using hc.a= there are 1+11+10 students belonging cluster 1;
#2+19 students belonging cluster 2; 12 students belonging cluster 3 and 
#3 students belonging cluster 4
#On the other hand, using hc.c:
#11 students in cluster 1; 3+19 in cluster 2; 12+3 in cluster 2
#and 10 students in cluster 4

#there is a sort of mismatch: 
#i.e : 11 students belong for both methods in cluster 1, but using hc.a other 13 students are belonging the same cluster (big difference)

#this table is useful to compare the two methods and the different
#cluster formation they arrive to


#######7. characterize the clusters and check for outliers

#we can calculate the average value for each cluster: 
#this help us to characterize each cluster

aggregate(z,list(member.c),mean)

#observe and comment the results
#higher the difference among mean, higher the importance of that variable
#in creating clusters

#in our examples both weight and height seems to have an impact in defining cluster membership
#it seems that students belonging to cluster 4 have low height and low weight
#students belonging to cluster 3 have very height height and weight
#etc.

#to make this interpretation easier, we may create that table using
#original values: 

aggregate(wh2,list(member.c),mean)

#the interpretation become easier because we can see the "real" 
#cluster average, and not the standardized one

#we may interpret the cluster using a graphical representation: 

plot(silhouette(cutree(hc.c,4),distance))

#the silhouette width is high: this information is quite good
#it means that all the members are belonging the "right" cluster
#we don't have negative values, so we don't have outliers

#######8. how to decide the right number of clusters?

#actually there is no unique/proper solution: 
#1. we should observe and try to interpret the dendogram and  
#2. we need to perform a screeplot

#######8.1. to interpret the dendogram we may use the rect.hclust command
plot(hc.c)
rect.hclust(hc.c,h=2)

#######8.2. to interpret our solution we may use a screeplot
#screeplot requires to calculate the within groups sum of squares
#lt's called wss

wss=(nrow(z)-1)*sum(apply(z,2,var))
for(i in 2:20)wss[i]<-sum(kmeans(z,centers=i)$withinss)
plot(1:20,wss,type="b",xlab="number of clusters",ylab="within groups SS")

#"within groups SS or wss" means within cluster variability
#using that plot we can decide the number of cluster to retain


######9. see cluster membership

#after our dendogram interpretation and the screeplot, 
#we decide to maintain  3 clusters
#now we want to know which students are belonging which clusters

cutree(hc.c,3)

#to make the interpretation easier: 

cbind(wh$ID,cutree(hc.c,3))

#first column = student ID; second column= membership's cluster
#i.e. : the first student belongs the first cluster; 
#i.e. : the 10th student belongs the 2nd cluster
#i.e. : the 24th student belongs the second cluster
#....

#nb: you obtain the same result using the data.frame command:

data.frame(wh$ID,cutree(hc.c,3))

#this command is useful when you have real name (and not just ID)

#if we want to know how many students belong each clusters: 

table(cutree(hc.c,3))

#i.e.: 21 in cluster 1; 22 in cluster 2 and 15 in cluster 3. 
# the groups are quite well balanced