Help for CLUSTEST

PURPOSE

    CLUSTEST determines how well the clusters in a statistics dataset are
separated by computing the Mahalanobis distance between pairs of clusters.
It can also help determine whether the the clusters are statistically
significant.
    In another mode of operation CLUSTEST can compare two statistics datasets
to determine how far apart the clusters of the different datasets are.  This
is useful for finding out if two different datasets are essentially the same.



EXECUTION

CLUSTEST STAT.SDS LEVEL=4 'CONFID

or

CLUSTEST (STAT1.SDS,STAT2.SDS)  LEVEL=3 'MAHA

    where the input files are statistical datasets of the type produced by
STATS, USTATS, and CLUSAN.  LEVEL is a parameter that specifies the amount
of printout desired with LEVEL=4 typing out everything.  If two input datasets
are specified then the comparison mode of operation goes into effect.  The
keywords (CONFID or MAHA) specify whether the Mahalanobis distance or the
corresponding fractional confidence level is output.




OPERATION

    For each cluster the statistical dataset contains the following
information:
1.  The class name (8 characters) which is not used in this program.
2.  The number of objects in the cluster.
3.  The mean or centroid of the cluster in each band.
4.  The covariance matrix of the cluster which contains information about the
	size, shape, and orientation of the cluster.


    For each pair of clusters CLUSTEST calculates the Mahalanobis distance
between the clusters.  The Mahalanobis distance is the distance between the
means of the clusters divided by the average of the sizes of the clusters.
Thus a large Maha distance indicates that the clusters are well separated
relative to their size.  If the clusters are up against each other (but not
intermingled) the Maha distance will be about 3 to 4.  If the Maha distance
is greater than about 4.5 then the clusters are quite distinct.  The mathe-
matical formula used is:

	MahaDist = Sqrt[ (Mu2-Mu1) *  S**-1  * (Mu2-Mu1) ]

	where Mu1 and Mu2 are the mean vectors for the clusters
	    and S is the pooled covariance matrix:

		S = ( (n1-1)*Cov1 + (n2-1)*Cov2 ) / (n1+n2-2)

	where n1 and n2 are the number of objects in the clusters and
	    Cov1 and Cov2 are the covariance matrices for the two clusters


    Determining whether the clusters are statistically significant is a tricky
business.  Even if the original data came from a population that was uniform
and had no intrinsic structure the random sampling process will introduce
some variations that the clustering algorithm will put into different clusters.
Thus it is necessary to know how well separated the clusters are to determine
if they do indeed represent actual structure in the underlying probability 
distribution that we are trying to learn about.  
    To determine if the clusters are statistically significant a statistical
test must be devised, which requires making some assumptions.  For this program
the null hypothesis (i.e. that the clusters are not statistically significant)
is that the data points come from a multidimesional uniform distribution which
has been cut in two along one axis.  For example, in two dimensions the null 
hypothesis is that the data points are equally likely to lie in any part of a 
square with all of those in one cluster coming from one side and all of those
in the other cluster coming from the other side.  Experimental tests have been
carried out to determine the probability of a particular Mahalanobis distance
for various numbers of data points, number of dimensions, and cutoff values.
The program contains an approximation to these results, so that Mahalanobis
distances can be translated into confidence levels.  The uniform distribution
assumption gives the most conservative confidence levels.  Other assumptions,
however, would be far too lenient unless one is assured that the the data 
followed the assumption.  The uniform assumption is likely to be too
conservative when there are many dimensions because there are then a huge
number of corners where real data is not likely to live.



RESTRICTIONS

The maximum number of clusters accepted is 200.
The maximum number of bands or channels is 12.


PRECISION AND PORTING NOTES  5-sEPT-94
    1.	The original test PDFs for both CLUSAN and CLUSTEST called GAUSNOIS
 	to create data files without specifying a SEED for the Random number
 	generator.  In order to produce repeatable test results, the procedures
 	have been changed to specify the same SEED each time the data sets are
 	generated.
 
    2.	CLUSTEST  uses double precision. However, the CLUSAN output 
 	statistical files which are used by CLUSTEST are single precision
 	(REAL*4).  The stated precision (from the VAX FORTRAN manuals) for
 	REAL*4 is "approximately 2**23, that is, typically seven decimal
 	digits".  In order to eliminate test differences (VAX vs. UNIX)
 	caused by reading the single precision values into double precision
 	variables, it is necessary to truncate the input values to six
 	decimal digits.  The truncation is currently being implemented by
 	this "ported' version of CLUSTEST.

 
    3.	The changes made for prting cause CLUSAN and CLUSTEST to perform
 	differently than the original baseline versions.  The main difference
 	appears to be the result of specifying default values for TEMP0 and
 	NCYCLES.  Not specifing default values for these parameters produce
 	inconsistent results between the different platforms.  For comparison
 	testing purposes, the original program was modified to use the same 
 	random number generator with the same seed value as the new ported 
 	versions.  However, even with these changes the original program does
 	not produce exactly the same results when run mutliple times.



Written by:	Frank Evans  	November 1985

Ported to UNIX: C.R. Schenk (CRI) 5-Sept-94

Cognizant Programmer:   K. F. Evans



PARAMETERS:


INP

Input statistical dataset(s)

LEVEL

Specifies amount of printout

MODE

'MAHA for Mahalanobis distances 'CONFID for confidence levels (default)

See Examples:


Cognizant Programmer: