Cluster analysis is a statistical technique used to identify how various units (people, groups, or societies), can be grouped together because of characteristics they have in common. It is an exploratory data analysis tool that aims to sort different objects into groups in such a way that when they belong to the same group they have a maximal degree of association and when they do not belong to the same group their degree of association is minimal. Unlike some other statistical techniques, the structures that are uncovered through cluster analysis need no explanation or interpretation – it discovers structure in the data without explaining why they exist.
Clustering exists in almost every aspect of our daily lives. Take, for example, items in a grocery store. Different types of items are always displayed in the same or nearby locations – meat, vegetables, soda, cereal, paper products, etc. Researchers often want to do the same with data and group objects into clusters that make sense.
In another example, let’s say we are looking at countries and want to group them into clusters based on characteristics such as division of labor, militaries, technology, educated population, etc. We would find that Britain, Japan, France, Germany, and the United States have similar characteristics and would be clustered together. China, Uganda, and Nicaragua would be also be grouped together in a different cluster because they share a different set of characteristics, including low levels of wealth, simpler divisions of labor, relatively unstable and undemocratic political institutions, and low technological development.
Cluster analysis is typically used in the exploratory phase of research when the researcher does not have any pre-conceived hypotheses. It is commonly not the only statistical method used, but rather is done toward the beginning phases of a project to help guide the rest of the analysis. For this reason, significance testing is usually neither relevant nor appropriate.
There are several different types of cluster analysis. The two most commonly used are K-means clustering and hierarchical clustering.
K-means clustering treats the observations in the data as objects having locations and distances from each other (note that the distances used in clustering often do not represent spatial distances). It partitions the objects into K mutually exclusive clusters so that objects within each cluster are as close to each other as possible and at the same time, as far from objects in other clusters as possible. Each cluster is then characterized by it’s mean, or center point.
Hierarchical clustering is a way to investigate groupings in the data simultaneously over a variety of scales and distances. It does this by creating a cluster tree with various levels. Unlike K-means clustering, the tree is not a single set of clusters. Rather, the tree is a multi-level hierarchy where clusters at one level are joined as clusters at the next higher level. The algorithm that is used starts with each case or variable in a separate cluster and then combines clusters until only one is left. This allows the researcher to decide what level of clustering is most appropriate for his or her research.
Performing A Cluster Analysis
Most statistics software programs can perform cluster analysis. In SPSS, select analyze from the menu, then classify and cluster analysis. In SAS, the proc cluster function can be used.
Mathworks. (Accessed January 2012). Statistics Toolbox: Cluster Analysis. http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/clusterdemo.html
StatSoft, Inc. (2011). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB: http://www.statsoft.com/textbook/.