Introduction
An important aim in analytic techniques is to reduce the number of observations that are being dealt with. By reducing observations, we are not referring to omission of important observations from the data set. What we refer here as ‘reducing observations’ which implies clubbing observations into groups, because groups are easier to be dealt with than individual observations.
This sort of reduction of observations into groups is also known as segmentation and a well-known technique of segmentation is to form clusters and analyse them in order to understand the nature of the different segments. Such an analysis of a segment is known as Cluster Analysis.
Objective of Cluster Analysis
The basic aim of cluster analysis is to reduce the number of observations. Though the functions which cluster
analysis performs is very similar to that of Factor Analysis, the main difference lies in the fact that Factor
Analysis is used to reduce the number of variables in a regression model by clubbing the correlated
explanatory variables into factors while cluster analysis is used to reduce the number of observations by
forming groups. Cluster analysis is useful for segmentation of markets.
Let us consider the following example:
Suppose Airtel is trying to launch a new talk-time plan which is aimed at making ISD calls cheaper. Then the
segment of the market that they should focus on is the Professionals who frequently travel abroad for onsite
projects. So, our aim is to club such observations together. So, the objective of Cluster Analysis is to group
nearly identical variables together
Details of the case study
The Indian Premier League (IPL) is a professional league for Twenty20 cricket championship in India. It was initiated by the Board of Control for Cricket in India (BCCI), headquartered in Mumbai, Maharashtra. It’s a franchise based tournament.
There are 5 ways that a franchisee can acquire a player:
(i) In the annual auction
(ii) signing domestic players
(iii) signing uncapped players
(iv) Through trading
(v) signing replacements.
The game is mainly dominated by the batsmen. So the valuation of the batsman depends on the cluster he belongs. At the same time, all the performance indicators may not be of equal importance. Under this reality, the governing body wants to find out what should be the number of clusters and which player should belong to which cluster.
Description of the Variables
Variables |
Description |
Player |
List of International players representing their respective countries |
Mat |
No. of matches they played Internationally |
Inns |
Total No. of innings they have batted |
Not_Outs |
No. of times they remained Not Out |
Runs |
Total runs scored in all innings combined |
HS |
Best individual score in an innings |
Ave |
No. of runs scored divided by the number of times they got out |
BF |
No. of Balls faced in all innings combined |
SR |
Strike Rate is no. of runs scored per Hundred balls faced. |
hundreds |
Total no. of centuries in entire cricketing career |
fifties |
Total no. of half centuries in entire cricketing career |
Ducks |
No. of times they got out without scoring |
fours |
No. of boundaries hit in lifetime |
sixes |
No. of sixes hit lifetime |
Objectives of the case study
- .The basic aim of cluster analysis is to reduce the number of observations.
- Though the functions which cluster analysis performs is very similar to that of Factor Analysis.
- The main difference lies in the fact that Factor Analysis is used to reduce the number of variables in a regression model by clubbing the correlated explanatory variables into factors While cluster analysis is used to reduce the number of observations by forming groups.
- Cluster analysis is useful for segmentation of markets.
Steps Involved in the Case Study :
1. Importing the dataset
2. Standardizing the dataset
3. Applying distance formula
4. Constructing the Dendogram
5. Grouping into respective Clusters
6. Segmentation of Players
Importing the dataset
using the read.csv() we can import the dataset cluster_ipl from the specified destination, header is True as we want to keep the variables intact while importing the data in R.
Checking for the structure of the dataset
str() can give us the characteristic of each variable in the dataset.
Standardizing the dataset
In cluster analysis, the idea is to club together related observations. In order to club together homogeneous observations, we need some sort of a composite weight. This weight is missing in the given data set since the different types of variables are in different units. It does not make any sense if I am adding up ‘runs scored’ with the number of ‘not outs’ or with the ‘number of sixes hit’ etc.
So, the first step would be to standardize the entire data set, so that all the variables become comparable.
iplstandard <- scale(ipl[,2:14])
scale() in R is generic function whose default method centres and/or scales the columns of a numeric matrix or to simply put it helps in standardizing the data with mean = 0 and std = 1.
Applying distance formula
dist() allows to calculate the distances of the observations based on the Euclidean distance formula.
To get an understanding of the brief statistics of the distance calculated, we use the summary();
summary(distmat)
hclust() function does the cluster formation based on the hierarchical clustering technique.
Constructing the Dendrogram
By looking into the dendrogram, the top portion gives us an idea of how many significant clusters are formed at each level. The height of clustering at each level depicts the homogeneity attained. From the above graph we consider approximately 4 significant cluster formations. Too many clusters can be a little cumbersome to check for whereas too less amount of cluster can end up in having heterogenous observation under one cluster.
Grouping into respective clusters
cutree() function, cuts a tree, e.g., as resulting from hclust, into several groups either by specifying the desired number(s) of groups or the cut height(s).
It basically assigns the cluster numbers to each of the observations in the dataset as per the hierarchical clustering method.
In order to group them separately we will subset this object considering the main dataset to get the players along with their respective variable values.
Segmentation of Players
Each of the cluster consists of their respective players and their information as per the segmentation made under ClusterNum.
Now to understand which cluster is superior than the other in terms of batsmen, we will have to obtain the descriptive statistics of each clusters.
Looking into the average/mean values of the batsmen related variable we can suggest that a cluster is superior than others in terms of batsmen.
Comments