Help for CLUSAN
PURPOSE
clusan is a clustering algorithm that uses the simulated annealing
optimization technique to find the best cluster partition. The input is
multidimensional data in MSS format. The data are grouped into clusters
which have the minimum scaled variance. This clustering algorithm is
designed for finding the best cluster partition of a small number of
multidimensional data points; the maximum is 50000 points.
EXECUTION EXAMPLES
clusan DATA.MSS (DATA.SDS,DATA.CLS) MSS=6 BANDS=(2,3,5,6) NCLUS=(6,4) INCR=10 SIZE=(100,100,500,500) THRESH=10.
The input data are in MSS format; i.e. the bands are appended together
in the sample direction using the program MSS. The pixels can be in any
format since they are converted to real in the program. The MSS parameter
specifies how many bands are in the input file.
The first output file is a statistics dataset (like those produced by
USTATS, etc) which contains the number of data points, the mean vector,
and the covariance matrix for each cluster. This file can be used by
programs CLUSTEST and DENDSTAT.
The second output file contains the input data for those bands that were
used in the clustering and another band which gives the cluster number
that the pixel was assigned to. This file is in MSS format and has the
same pixel format as the input file.
The BANDS parameter specifies which bands in the input file will be
used for the clustering. This allows the user to try clustering various
combinations of bands without having to make up a new input dataset each
time. There must be at least two, but no more than twelve bands.
The NCLUS parameter guides the algorithm in choosing the number of
clusters (see theory of operation). The first number is the desired number
of clusters and the second number is about how many clusters the algorithm
can stray from the desired number.
The INCR (or LINC/SINC) parameter is used to cut down on the number of
points to be used (since the maximum number is 50000). This is also the
purpose of the SIZE parameter.
The THRESH parameter sets a minimum value for valid pixels (objects).
Note that in the cluster dataset output all pixels on a line that lie
below THRESH are moved to the end of the line and assigned the value 0.
clusan DATA.CLS (DATA2.SDS,DATA2.CLS) MSS=5 BANDS=(1,2,3,4) NCLUS=(8,2) 'RESTART NCYCLE=25 NITER=100 TEMP=0.5
This example which uses the remaining parameters shows how to restart a
clustering run. To restart the run, the RESTART keyword is specified
and the second output file from the previous run is used for the input
file. Note that the bands in the new input file are in different positions
than the bands in the old input file, and that the INCR and SIZE parameters
need not be respecified, as the file has been compressed accordingly.
MSS is now 5, as the input contains only the 4 bands that were processed
in the first run, plus an additional pseudo-band containing cluster numbers.
The last three parameters will not usually have to be specified, but allow
some control over the optimization algorithm (see theory of operation).
The NCYCLE parameter is the maximum number of temperature cycles the
algorithm will perform. If it is large then the algorithm will lower the
temperature very slowly, but if it is too small the system will be
quenched and will not be at a true minimum energy configuration.
The algorithm will stop before doing NCYCLE temperature cycles if
the system is in a stable state (i.e. the energy is no longer decreasing).
The NITER parameter specifies how many iterations are to be done for
each cooling cycle. The default is to do 4 times the number of data points,
so that on average each point will be considered about four times per
cycle.
The TEMP parameter gives the initial temperature to start the system
off with. The initial temperature should be of the order of unity.
A higher starting temperature will make the system behave more randomly
at first.
RESTRICTIONS
1. The maximum number of data points is 50,000. (The LINC/SINC parameters
may be used to reduce the number of points used from the input image.)
2. The maximum number of clusters is 100.
3. The number of bands to cluster must be between 2 and 12 inclusive.
4. The maximum sample size of the input MSS file is 60,000 samples.
5. The maximum number of lines that are used (i.e., (NL-SL)/LINC+1)
in the input MSS file is 10,000.
THEORY OF OPERATION
Clustering is the technique of grouping data points (or objects) into a
number of groups based on similarities in the objects' measured attributes.
While for a small number of attributes (one or two variables) graphing the
data points can be used to easily find the clusters, as the number of attri-
butes becomes larger visual methods become impossible. Many computer
algorithms exist which cluster data points in any number of dimensions.
One approach to the problem of clustering is that of optimization. An
objective function which measures the degree of clusterness is maximized over
all possible partitions of the data points into clusters. Of course, a wide
range of such functions could be imagined. There are also a wide variety of
optimization methods available to maximize or minimize functions. Most such
methods are inappropriate because of the nature of clustering. Clustering
involves discrete movements of objects between clusters, so that derivatives
are not a particularly useful concept. Also there are many local minima
for optimization algorithms to get caught in. One new algorithm which over-
comes these difficulties is called simulated annealing.
Simulated annealing is analogous to the cooling and crystalization process
of a crystalline solid. The objective function to be minimized is called the
energy of the system. An object is randomly moved from one cluster to another
in analogy with the random motions of the atoms in a crystal. If the energy
is thus decreased the change is accepted. If the energy is increased the
change is still accepted with a probability depending on the Boltzmann factor:
exp(-deltaE/Temp). Thus increases in the energy comparable to the temperature
still have a good chance of being accepted while large increases in the energy
have very little chance of being accepted. This allows the algorithm to
rattled its way out of local minima. The temperature is at first set high so
that a lot of random motion occurs. It is then gradually lowered until the
system freezes. How exactly the temperature is lowered is called the
annealing schedule. There are many possibilities for the annealing schedule,
but the one used in this implementation is that of Newton's law of cooling,
i.e. the temperature decreases exponentially:
Temp = InitialTemp * exp(-const*Time) .
The objective function in this program is the normalized average of the
cluster variances. The variances in each attribute dimension are scaled by
the total variance of all of the data points in that dimension. Thus the
results are independent of all scalings and translations of the attributes.
The variances of the clusters are averaged by weighting by the number of
points in each cluster. Finally the average of the variances is scaled by
the number of attributes and scaled according to the number of clusters, so
that the energy function is invariant and can be compared between configura-
tions with different number of clusters. The energy should always be of
order one. The energy function treats all of the attributes the same and will
tend to make hyperspherical clusters even if the clusters should be long and
stringy.
The algorithm also uses the simulated annealing technique to find the
number of clusters. Every once in a while, it will try lumping two clusters
together or splitting one cluster up into two and will do the same Boltzmann
calculation to determine whether to accept the change. The algorithm can be
guided to give about a certain number of clusters with the NCLUS parameter.
The first value gives the number of clusters desired and the second value gives
a rough number of how many in either direction is acceptable. NCLUS=(6,2)
means "I want 6 clusters give or take about 2". This feature is implemented
by adding a parabolic term to the energy function, i.e. a term whose value
goes as the square of the difference between the actual number of clusters and
the desired number.
OPERATION STRATEGY
The program starts out by randomly selecting the specified number of
objects and setting all of the objects to the closest cluster. Run the
program a few times on the same data to see if same minimum energy is
reached. If you want the program to decide entirely by itself the best
number of clusters, make NCLUS(2) have some large value. If the program
has trouble converging to the same result it probably means that the data
is not particularly clustered. Generally the program quickly and reliably
converges in cases where the data has well defined groupings.
PRECISION: 5-Sept-94
1. This portable version uses double precision,which provides a
capability to adjust the sensitivity to changes in the "ENERGY"
For portablity testing, this version ignores delta "ENERGY"
changes less than .0000005.
2. CLUSTEST has always used double precision. The CLUSAN output
statistical files which are used by CLUSTEST are single precision
(REAL*4). The stated precision (from the VAX FORTRAN manuals) for
REAL*4 is "approximately 2**23, that is, typically seven decimal
digits". In order to eliminate test differences (VAX vs. UNIX)
caused by reading the single precision values into double precision
variables, it is necessary to truncate the input values to six
decimal digits. The truncation is currently being implemented by
the "ported' version of CLUSTEST.
3. The changes made for porting cause CLUSAN and CLUSTEST to perform
differently than the original baseline versions. The main difference
appears to be the result of specifying default values for TEMP0 and
NCYCLES. Not specifing default values for these parameters produce
inconsistent results between the different platforms. For comparison
testing purposes, the original program was modified to use the same
random number generator with the same seed value as the new ported
versions. However, even with these changes the original program does
not produce exactly the same results when run mutliple times.
REFERENCES
Optimization by Simulated Annealing
Kirkpatrick, Gelatt, and Vecchi,
Science, May 13, 1983 , No. 220
Cluster Analysis For Applications,
Anderberg, Michael R., Academic Press (1973).
Classification: Methods for the Exploratory Analysis of Multivariate Data,
Gordon, A. D., Chapman and Hall (1981).
HISTORY
Original Programmer: Frank Evans November 1985
Cognizant Progrmmer: Frank Evans
Revisions:
1987-04-06 L.W.Kamp, added INCR/LINC/SINC & THRESH parameters, fixed
SS/NS, increased BUFFER size.
1994-09-05 C.R. Schenk (CRI) Ported to UNIX & changed to double
precision.
2016-11-15 W.L. Bunch Set return type of DISTFUNC to REAL*8.
PARAMETERS:
INP
Input attribute file
OUT
Output statistics data set
and cluster file
SIZE
Standard VICAR size field
SL
Starting line
SS
Starting sample (per band !)
NL
Number of lines
NS
Number of samples (per band !)
MSS
Number of bands in MSS format
BANDS
List of bands (variables)
to cluster
INCR
Line/sample increment
LINC
Line increment
SINC
Sample increment
THRESH
Minimum valid DN
NCLUS
The number of clusters desired
The range in number of clusters
NCYCLE
The number of temperature cycles
NITER
The number of iterations per
cycle
TEMP
The initial temperature
RESTART
'RESTART to restart a clustering
See Examples:
Cognizant Programmer: