Help for IBISSTAT
PURPOSE
"ibisstat" performs statistical analyses on IBIS tabular files.
Output is to the terminal and/or another tabular file.
Currently, nine types of statistical analysis methods are available:
'SUMMARY Statistical summary (median, mean, etc.)
'HIST Histogram
'SCATTER Scatter plot
'CORR Correlation
'BEHRENS Behrens-Fisher test for different means (multivariate)
'REGRESS Multiple linear regression
'ANOVA One-way analysis of variance
'FACTOR Principal components factor analysis
'DENSITY Selected probability densities
EXECUTION
ibisstat INP=TABLE.INT OUT=STAT.INT 'method COLS=( , ) COLNAMES=("","")
ibisstat TABLE.INT COLS=(...) 'method (parameters)
Each option requires that the COLS parameter be specified. This
parameter specifies the columns in the tabular file which contain the
data to be operated on. COLNAMES is an optional parameter that is used
to apply an eight character label to each column for use in the printouts.
The 'NOPRINT keyword can be used to suppress all printout for when the
output should only go to a file. An output IBIS tabular file may be
optionally specified, however, not all of the options use an output file.
DESCRIPTION OF THE STATISTICAL ANALYSIS METHODS
'SUMMARY STATISTICAL SUMMARY
The statistical summary option calculates some simple statistical
values independently for each column specified. A table is printed
showing the median, mean, standard deviation, minimum, and maximum
for each column. The standard deviation is calculated with N-1
weighting. The optional output tabular file has 5 rows (median,
mean, std, min, max) and as many columns as input.
'HIST HISTOGRAM
The histogram option produces a terminal plot of the histogram
of each column specified. The limits on the histogram are the
minimum and maximum data values in the column, and the number
of bins is given by the BINS parameter (default is 20). For
each bin the center data value and the number of data points
in the bin are printed and then the appropriate number of *'s
are printed. If more than one column is specified then the
histograms are done in order.
The optional output file consists of two columns for every
input column and a row for each histogram bin. The first column
in each pair has the center data value of the bins and the second
column has the histogram count.
'SCATTER SCATTER PLOT
The scatter plot option produces terminal plots of two variables.
The COLS parameter specifies a pair of columns for each plot: the first
column in the pair is the X variable and the second is the Y variable.
The size of the plot (in characters on the terminal) is specified with
the SCATSIZE parameter. The plot prints *'s where there is one data point,
the digits "2" through "8" for two to eight data points, and "9" for
nine or more points.
The optional output file is not used.
'CORR CORRELATION
The correlation option calculates the correlation, covariance,
and the significance level for each pair of variables in a
multivariate dataset. Each column is a different variable, and
each row is a data point. If there are M variables (columns) then
there is an M by M matrix of correlation values. The matrix is
symmetric, and only the lower triangular portion is printed. For
each matrix entry in the printout the first number is the correlation,
the second number is the covariance, and the third one is the
probability or significance level. On the diagonal of the matrix
the correlations are 1 (variables are perfectly correlated with
themselves) and the covariances are the variance of each variable.
The significance level is the probability that the level of correlation
has come about by chance. This test assumes that the data is normally
distributed. The test is a one-tailed test of statistical significance.
The optional output file has M rows and M columns and contains the
covariance matrix.
'BEHRENS BEHRENS-FISHER TEST
The Behrens option is used to test whether two multivariate samples
have the same mean vector. It is a multivariate generalization of
the Students T test, and does not assume that the two distributions
have the same size and shape. The columns containing the first
sample are specified with the COLS parameter, while the columns
with the second sample are specified with the BCOLS parameter.
The numbers of data points in the two samples are specified with the
NUMPOINTS parameter, and thus can be different. Hotelling's T squared
statistic and the attained level of significance are printed out.
The samples are assumed to come from two multivariate normal
distributions with possibly different covariance matrices.
The optional output file is not used.
'REGRESS MULTIPLE REGRESSION
The regression option performs a multiple linear regression and
calculates some statistics. The COLS parameter specifies the
columns containing the independent variables and the DEPCOL parameter
specifies the dependent variable column. The optional ERRCOL parameter
specifies the column that contains the estimated errors (uncertainties)
in the dependent data.
The regression constant and coefficient for each variable are printed
out in the COEFFICIENT column. The one-sigma confidence interval for
each regression coefficient is printed out in the ERROR column. The
number after the 'ERROR:' is the probability of the coefficient being
within plus or minus the error. If a higher probability confidence
interval is desired just multiply the interval by the appropriate
number from the T statistic for N-M-1 degrees of freedom, e.g. 2.660
for 99% probability for df=60 (N is number of data points and M is
number of variables). This confidence interval is calculating from
the scatter in the data and is not based in any way on the input
estimated uncertainties.
The R squared statistic is the fraction of the total variance in
the dependent data that is explained by the regression. The standard
error of the estimate is the RMS average of the residuals (the misfit
between the predicted and actual dependent data). The F ratio test
determines the significance of the overall regression, i.e. the
probability that not all of the coefficients are actually zero. A high
level of significance does not mean that all of the variables are
significant just that at least one is. The Durbin-Watson statistic
indicates how sequentially correlated the residuals are; uncorrelated
residuals have a statistic around 2, while correlated residual will
have a statistic less than 1.
If the ERRCOL is specified then the goodness of fit chi squared will
be calculated. The chi squared statistic is the sum of the squares
of each residual divided by each uncertainty. If the estimated
uncertainties are correct and the regression fits the data then the
statistic will be about 1. The associated probability printed is the
probability of the statistic being that far from 1 just due to random
chance.
The optional output file contains two columns: the first contains
the M+1 regression coefficients (including the constant), and the
second contains the residuals.
NOTE: The multiple regression technique assumes that the residuals
are uncorrelated and come from a normal distribution.
'ANOVA ANALYSIS OF VARIANCE
The ANOVA option performs one-way analysis of variance on a
table of data. Each column represents a separate group that has
been treated differently. Analysis of variance is a statistical
test that determines whether any of the groups has a significantly
different mean value from the rest (e.g. whether the "treatment" has
had any significant effect). The groups need not have the same size:
the number of replications for each group is specified with the
REPLICS parameter. The analysis of variance technique assumes that
the data points in each group are sampled from a normal distribution
with the same variance.
In the printout the grand mean is mean of all of the groups
put together, and the estimate of effects is the difference between
the group means and the grand mean. The F test determines the
significance level of the hypothesis that at least one group
has a non zero estimate of effect.
The optional output file is not used.
'FACTOR FACTOR ANALYSIS
The factor option performs principal components factor analysis
on a multivariate dataset. This involves calculating the covariance
matrix of the data, and then finding the eigenvalues and eigenvectors
of the matrix. The eigenvectors are the coefficients for linearly
translating the original variables into a new set of variable or factors.
This new set of variables has the property that the data expressed in
terms of the factors has no cross-correlation between different variables.
The eigenvalues are the variances of the factors. The table gives
the coefficients for calculating the factors in terms of the original
variables.
Factor analysis can be used to find a new set of variables that
is smaller than the original set but that explain most of the variance
in the data. The first few factors may explain most of the variance
(according to the eigenvalues) and the rest can be ignored. In
geometrical terms the data points are a cloud with a certain size, shape,
and direction. Principal component analysis finds how to rotate the
coordinates so that the principal axis of the cloud lie along the
coordinate axis. The square root of the variances is roughly the size
of the cloud in the principal directions. The eigenvectors, which are
ortho-normal, are the unit vectors in the rotated space.
The optional output file contains the original data translated to
the new set of variables. The correlation option of IBISSTAT will
verify that the transformed data has no cross-correlation.
'DENSITY PROBABILITY DENSITIES
The density option calculates the cumulative probability of given
values for four distributions: normal, chi squared, Student's T, and
Fisher's F. The input IBIS tabular file contains the values of the
statistic. The output IBIS file contains two columns for each input
column: first, the values of statistic from the input file, and second
the corresponding cumulative probabilities. The parameter DISTRIB
specifies the particular probability distribution (or density).
The chi squared and Student's T distributions required one degrees
of freedom parameter, and the Fisher's F distribution requires two.
The parameter NDF specifies the number of degrees of freedom to use
for each column. In this way a table can be generated with different
columns for each number of df's desired.
EXAMPLES
To perform a cubic fit to some data:
Generate some fake data:
ibis-gen DATA.INT NC=4 NR=50
mf DATA.INT FUNCTION=("C1=INDEX/25", "C4=11.5 +5.0*C1 -2.3*C1*C1 +0.3*(C1**3)")
Assume X in column 1, and Y in column 4.
mf DATA.INT FUNC=("C2=C1*C1", "C3=C1*C1*C1")
ibisstat DATA.INT 'REGRESS COLS=(1,2,3) DEPCOL=4 COLNAMES=("X", "X**2", "X**3")
To perform principal components analysis on an MSS format image:
mssibis DATA.MSS DATA.INT MSS=4 NL=500 NS=400 INC=5
ibisstat DATA.INT FACT.INT COLS=(1,2,3,4)
mssibis FACT.INT FACT.MSS 'TOMSS NL=100 NS=80
RESTRICTIONS
The maximum column length is 100,000.
The maximum amount of data (columns times column length) is 250,000.
The maximum number of input columns is 40.
WRITTEN BY: K. F. Evans October 1986
COGNIZANT PROGRAMMER: K. F. Evans
REVISION:
2-95 - Meredith Cox (CRI) - Made portable for UNIX
PARAMETERS:
INP
Input IBIS tabular file.
OUT
An optional output file.
Not used by all methods.
COLS
The columns on which to perform
the statistical analysis.
COLNAMES
An optional eight character
heading for each column.
OPTION
Keyword to select the
analysis method.
(SUMMARY,HIST,SCATTER,CORR,BEHRENS,
REGRESS,ANOVA,FACTOR,DENSITY)
NOPRINT
Keyword to suppress printout.
BINS
The number of bins.
Only used for HISTOGRAM.
SCATSIZE
The size of the plot (x,y).
Only used for SCATTER.
DEPCOL
Dependent variable column.
Only used for REGRESSION.
DEPNAME
An optional eight character
heading for the dependent
variable.
Only used for REGRESSION.
ERRCOL
The column containing the
estimated errors of the
dependent data. (optional).
Only used for REGRESSION.
REPLICS
The number of replications for
each cell. Only used for ANOVA.
BCOLS
The columns for the
"B" multivariate sample.
Only used in BEHRENS.
NUMPOINT
The number of data points
for the A and B samples.
Only used in BEHRENS.
DISTRIB
Keyword for the probability
distribution.
(NORMAL,CHISQ,STUDENTT,FISHERF)
Only used in DENSITY.
NDF
The number of degrees of freedom
for each column. One per column
for CHISQ and STUDENTT, and two
per column for FISHERF.
Only used in DENSITY.
See Examples:
Cognizant Programmer: