Help for IBISSTAT

PURPOSE

    "ibisstat" performs statistical analyses on IBIS tabular files.
    Output is to the terminal and/or another tabular file.
    Currently, nine types of statistical analysis methods are available:

     'SUMMARY     Statistical summary (median, mean, etc.)
     'HIST        Histogram 
     'SCATTER     Scatter plot 
     'CORR        Correlation
     'BEHRENS     Behrens-Fisher test for different means (multivariate)
     'REGRESS     Multiple linear regression
     'ANOVA       One-way analysis of variance
     'FACTOR      Principal components factor analysis
     'DENSITY     Selected probability densities

EXECUTION

    ibisstat  INP=TABLE.INT OUT=STAT.INT  'method COLS=( , )  COLNAMES=("","")

    ibisstat TABLE.INT   COLS=(...)  'method   (parameters)



    Each option requires that the COLS parameter be specified.  This
parameter specifies the columns in the tabular file which contain the 
data to be operated on.  COLNAMES is an optional parameter that is used
to apply an eight character label to each column for use in the printouts.
The 'NOPRINT keyword can be used to suppress all printout for when the
output should only go to a file.  An output IBIS tabular file may be
optionally specified, however, not all of the options use an output file.

DESCRIPTION OF THE STATISTICAL ANALYSIS METHODS


'SUMMARY            STATISTICAL SUMMARY

        The statistical summary option calculates some simple statistical 
    values independently for each column specified.  A table is printed 
    showing the median, mean, standard deviation, minimum, and maximum 
    for each column.  The standard deviation is calculated with N-1 
    weighting.  The optional output tabular file has 5 rows (median, 
    mean, std, min, max) and as many columns as input.



'HIST                 HISTOGRAM 

        The histogram option produces a terminal plot of the histogram
    of each column specified.  The limits on the histogram are the
    minimum and maximum data values in the column, and the number
    of bins is given by the BINS parameter (default is 20).  For
    each bin the center data value and the number of data points
    in the bin are printed and then the appropriate number of *'s
    are printed.  If more than one column is specified then the
    histograms are done in order.
        The optional output file consists of two columns for every
    input column and a row for each histogram bin.  The first column
    in each pair has the center data value of the bins and the second
    column has the histogram count.

'SCATTER              SCATTER PLOT

    The scatter plot option produces terminal plots of two variables.
    The COLS parameter specifies a pair of columns for each plot: the first 
    column in the pair is the X variable and the second is the Y variable.  
    The size of the plot (in characters on the terminal) is specified with 
    the SCATSIZE parameter.  The plot prints *'s where there is one data point, 
    the digits "2" through "8" for two to eight data points, and "9" for 
    nine or more points.
    The optional output file is not used.


'CORR                 CORRELATION

        The correlation option calculates the correlation, covariance,
    and the significance level for each pair of variables in a 
    multivariate dataset.  Each column is a different variable, and
    each row is a data point.  If there are M variables (columns) then
    there is an M by M matrix of correlation values.  The matrix is
    symmetric, and only the lower triangular portion is printed.  For
    each matrix entry in the printout the first number is the correlation,
    the second number is the covariance, and the third one is the 
    probability or significance level.  On the diagonal of the matrix
    the correlations are 1 (variables are perfectly correlated with
    themselves) and the covariances are the variance of each variable.
    The significance level is the probability that the level of correlation
    has come about by chance.  This test assumes that the data is normally 
    distributed.  The test is a one-tailed test of statistical significance.
    The optional output file has M rows and M columns and contains the
    covariance matrix.
'BEHRENS           BEHRENS-FISHER TEST 

        The Behrens option is used to test whether two multivariate samples
    have the same mean vector.  It is a multivariate generalization of
    the Students T test, and does not assume that the two distributions
    have the same size and shape.  The columns containing the first
    sample are specified with the COLS parameter, while the columns
    with the second sample are specified with the BCOLS parameter.
    The numbers of data points in the two samples are specified with the
    NUMPOINTS parameter, and thus can be different.  Hotelling's T squared
    statistic and the attained level of significance are printed out.
    The samples are assumed to come from two multivariate normal 
    distributions with possibly different covariance matrices.
    The optional output file is not used.
'REGRESS            MULTIPLE REGRESSION

        The regression option performs a multiple linear regression and
    calculates some statistics.  The COLS parameter specifies the
    columns containing the independent variables and the DEPCOL parameter 
    specifies the dependent variable column.  The optional ERRCOL parameter 
    specifies the column that contains the estimated errors (uncertainties) 
    in the dependent data.  
        The regression constant and coefficient for each variable are printed 
    out in the COEFFICIENT column.  The one-sigma confidence interval for 
    each regression coefficient is printed out in the ERROR column.  The 
    number after the 'ERROR:' is the probability of the coefficient being 
    within plus or minus the error.  If a higher probability confidence 
    interval is desired just multiply the interval by the appropriate 
    number from the T statistic for N-M-1 degrees of freedom, e.g. 2.660 
    for 99% probability for df=60  (N is number of data points and M is 
    number of variables).  This confidence interval is calculating from 
    the scatter in the data and is not based in any way on the input 
    estimated uncertainties.  
        The R squared statistic is the fraction of the total variance in
    the dependent data that is explained by the regression.  The standard
    error of the estimate is the RMS average of the residuals (the misfit
    between the predicted and actual dependent data).  The F ratio test 
    determines the significance of the overall regression, i.e. the
    probability that not all of the coefficients are actually zero. A high
    level of significance does not mean that all of the variables are
    significant just that at least one is.  The Durbin-Watson statistic
    indicates how sequentially correlated the residuals are;  uncorrelated
    residuals have a statistic around 2, while correlated residual will
    have a statistic less than 1.
        If the ERRCOL is specified then the goodness of fit chi squared will
    be calculated.  The chi squared statistic is the sum of the squares
    of each residual divided by each uncertainty.  If the estimated
    uncertainties are correct and the regression fits the data then the
    statistic will be about 1.  The associated probability printed is the 
    probability of the statistic being that far from 1 just due to random
    chance.
        The optional output file contains two columns:  the first contains
    the M+1 regression coefficients (including the constant), and the
    second contains the residuals.
        NOTE: The multiple regression technique assumes that the residuals 
    are uncorrelated and come from a normal distribution.

'ANOVA                ANALYSIS OF VARIANCE

        The ANOVA option performs one-way analysis of variance on a
    table of data.  Each column represents a separate group that has
    been treated differently.  Analysis of variance is a statistical
    test that determines whether any of the groups has a significantly
    different mean value from the rest (e.g. whether the "treatment" has
    had any significant effect).  The groups need not have the same size:
    the number of replications for each group is specified with the
    REPLICS parameter.  The analysis of variance technique assumes that
    the data points in each group are sampled from a normal distribution
    with the same variance.
        In the printout the grand mean is mean of all of the groups 
    put together, and the estimate of effects is the difference between
    the group means and the grand mean.  The F test determines the 
    significance level of the hypothesis that at least one group
    has a non zero estimate of effect.
        The optional output file is not used.
'FACTOR              FACTOR ANALYSIS

        The factor option performs principal components factor analysis
    on a multivariate dataset.  This involves calculating the covariance
    matrix of the data, and then finding the eigenvalues and eigenvectors
    of the matrix.  The eigenvectors are the coefficients for linearly
    translating the original variables into a new set of variable or factors.
    This new set of variables has the property that the data expressed in 
    terms of the factors has no cross-correlation between different variables.
    The eigenvalues are the variances of the factors.   The table gives
    the coefficients for calculating the factors in terms of the original
    variables.
        Factor analysis can be used to find a new set of variables that
    is smaller than the original set but that explain most of the variance
    in the data.  The first few factors may explain most of the variance
    (according to the eigenvalues) and the rest can be ignored.  In
    geometrical terms the data points are a cloud with a certain size, shape,
    and direction.  Principal component analysis finds how to rotate the
    coordinates so that the principal axis of the cloud lie along the
    coordinate axis.  The square root of the variances is roughly the size
    of the cloud in the principal directions.  The eigenvectors, which are
    ortho-normal, are the unit vectors in the rotated space.
        The optional output file contains the original data translated to
    the new set of variables.  The correlation option of IBISSTAT will
    verify that the transformed data has no cross-correlation.



'DENSITY           PROBABILITY DENSITIES

    The density option calculates the cumulative probability of given
    values for four distributions: normal, chi squared, Student's T, and
    Fisher's F.  The input IBIS tabular file contains the values of the
    statistic.  The output IBIS file contains two columns for each input
    column: first, the values of statistic from the input file, and second
    the corresponding cumulative probabilities.  The parameter DISTRIB 
    specifies the particular probability distribution (or density).  
    The chi squared and Student's T distributions required one degrees 
    of freedom parameter, and the Fisher's F distribution requires two.  
    The parameter NDF specifies the number of degrees of freedom to use 
    for each column.  In this way a table can be generated with different 
    columns for each number of df's desired.

EXAMPLES

To perform a cubic fit to some data:

   Generate some fake data:
    ibis-gen  DATA.INT   NC=4 NR=50
    mf        DATA.INT   FUNCTION=("C1=INDEX/25",  				   "C4=11.5 +5.0*C1 -2.3*C1*C1 +0.3*(C1**3)")
   Assume X in column 1, and Y in column 4.
    mf         DATA.INT  FUNC=("C2=C1*C1", "C3=C1*C1*C1")
    ibisstat   DATA.INT  'REGRESS  COLS=(1,2,3) DEPCOL=4   				COLNAMES=("X", "X**2", "X**3")


To perform principal components analysis on an MSS format image:

    mssibis   DATA.MSS  DATA.INT  MSS=4  NL=500 NS=400 INC=5
    ibisstat  DATA.INT  FACT.INT  COLS=(1,2,3,4)
    mssibis   FACT.INT  FACT.MSS  'TOMSS  NL=100 NS=80

RESTRICTIONS

    The maximum column length is 100,000.
    The maximum amount of data (columns times column length) is 250,000.
    The maximum number of input columns is 40.


WRITTEN BY:            K. F. Evans	October 1986

COGNIZANT PROGRAMMER:  K. F. Evans

REVISION:  
           2-95 - Meredith Cox (CRI) - Made portable for UNIX


PARAMETERS:


INP

Input IBIS tabular file.

OUT

An optional output file. Not used by all methods.

COLS

The columns on which to perform the statistical analysis.

COLNAMES

An optional eight character heading for each column.

OPTION

Keyword to select the analysis method. (SUMMARY,HIST,SCATTER,CORR,BEHRENS, REGRESS,ANOVA,FACTOR,DENSITY)

NOPRINT

Keyword to suppress printout.

BINS

The number of bins. Only used for HISTOGRAM.

SCATSIZE

The size of the plot (x,y). Only used for SCATTER.

DEPCOL

Dependent variable column. Only used for REGRESSION.

DEPNAME

An optional eight character heading for the dependent variable. Only used for REGRESSION.

ERRCOL

The column containing the estimated errors of the dependent data. (optional). Only used for REGRESSION.

REPLICS

The number of replications for each cell. Only used for ANOVA.

BCOLS

The columns for the "B" multivariate sample. Only used in BEHRENS.

NUMPOINT

The number of data points for the A and B samples. Only used in BEHRENS.

DISTRIB

Keyword for the probability distribution. (NORMAL,CHISQ,STUDENTT,FISHERF) Only used in DENSITY.

NDF

The number of degrees of freedom for each column. One per column for CHISQ and STUDENTT, and two per column for FISHERF. Only used in DENSITY.

See Examples:


Cognizant Programmer: