STAT : Elementary Statistics On Symbolic Objects

Françoise and Jacques GOUPIL
LISE-CEREMADE
University Paris IX Dauphine 75775 Paris France
e-mail: goupil@ceremade.dauphine.fr


Presentation of STAT

Abstract of the methods, input and output

    1. Relatives frequencies for multinominal variables
    2. Relative frequencies for interval variables.
    3. Capacities and min/max/mean for probabilistic multinominal variables.
    4. Biplot for interval variables
Installation

Running STAT

Using the listing

Using the graph

Error messages

Referenced documents

  1. Presentation of STAT

  2. STAT extends to symbolic objects, represented by their description, several "elementary statistics" methods usually limited to conventional data.

    It is a component of the SODAS software package, and as such is designed to run under the SODAS workbench and process SODAS data bases.

    The relevant methods depend on the types of variables found in the SODAS base selected, and are filtered accordingly by the workbench:

    1. relative frequencies for multinominal variables

    2. relative frequencies for interval variables

    3. capacities and min/max/mean for probabilistic multinominal variables

    4. biplot for interval variables

    Also covered is the central object identification, which does not depend on specific variable types.

    The output data from the selected methods can be looked at in two ways: listing and graph, called via the dedicated icons of the workbench.

    The graph can interactively be changed and customized (figures, shapes, colors, texts, comments, ...) and can be copied and saved.

  3. Abstract of the methods, input and output
For a more detailed description, you may refer to the documents listed in the Bibliography section.

In the following a) b) c) methods the input is always an array where there is a line for each symbolic object(individual) and a column for each variable.

  1. Relative frequencies for multinominal variables.

  2. In this method we compute the relative frequency of each modality of the multinominal variable, taking into account the given rules from the base. The graphic associated to the variable distribution can be either a bar chart or a pie chart (see examples below).

    In the following example the individuals are species of mushrooms described by several variables like cap presence, cap shape, etc… and there are two kinds of graphics which can be associated.

    INDIVIDUALS :

    1 S1

    2 S2

    3 S3

    4 S4

    VARIABLES :

    1 Cap presence

    1 present

    2 absent

    2 Cap shape

    1 square

    2 round

    3 triangular

    4 not applicable

    3 Cap color

    1 red

    2 green

    3 white

    4 black

    5 yellow

    6 not applicable

    .....

    MATRIX :

    Var 1 Var 2 Var 3 .....

    Ind 1 1 1,3 3,4

    Ind 2 1 1,3 2,4

    Ind 3 1 1,3 2,3

    Ind 4 2 4 6

    RULES :

    1=2 --> 2=4

    1=2 --> 3=6

    3=4 --> 2=3

    3=4 --> 5=1,3

    3=3 --> 5=3

    the first rule means that if the cap is absent ( value of var 1 = 2), the value of var 2 ( cap shape) is not applicable.

  3. Relative frequencies for interval variables.
Let X be an interval variable observed on a symbolic objects array. We can build an histogram for the variable X on interval [a,b], where a is the lower bound of X and b the upper bound.The computation of the relative frequencies associated to a class Ck takes into account the " recovering " of Ck by the interval values of X on each symbolique object.

In the following example, individuals are wine tasters and variables are wine " chateaux ". In the matrix cells we get the interval of the ratings given by a taster to a wine.

INDIVIDUALS :

1 AC

2 BY

3 CG

4 CQ

......

VARIABLES :

1 Ausone

2 Cheval Blanc

3 Cos d'Estournel

4 Ducru-Beaucaillou

5 Haut-Brion

6 L'Evangile

7 Lafite-Rothschild

8 Lafleur

............

MATRIX :

Var 1 Var 2 ...... Var 7 ......

Ind 1 56:74 75:92 ...... 64:82 ......

Ind 2 83:85 89:94 ...... 81:92 ......

Ind 3 84:90 86:92 ...... 87:90 ......

Ind 4 80:91 85:93 ...... 85:91 ......


 
 
 
 

c) Capacities and min/max/mean for probabilistic multinominal Variables.

The values of a probabilistic multinominal variable V on symbolic objects

SO1, SO2,...Son are for example :

SO1 -à p11M1, p12M2, p13M3 avec p11+p12+p13=1

SO2 -à p21M1, p22M2, p23M3 avec p21+p22+p23=1

.

.

Son -à pn1M1, pn2M2, pn3M3 avec pn1+pn2+pn3=1

In a capacity histogram, the capacity of a modality is the union capacity. Then, the capacity of (SO1 and SO2)for M1 is p11 + p21 - p11 * p21 and the capacity of SO1, SO2,...Son for M1 is computed using associativity property.

A min/mean/max graphic associates to each modality a sort of boxplot that represents the range and the mean of the probabilities of that modality.

The min value associated to M1, is the minimum of p11 , p21 ,.. , pn1 .

The mean value associated to M1, is the average of p11 , p21 ,.. , pn1 .

The max value associated to M1, is the maximum of p11 , p21 ,.. , pn1 .

In the following example, individuals are England areas and the variables deal with house equipment. Variable number 9 is a multinomial probabilistic variable which modalities are 1, 2, 3, 4. The matrix gives the probabilistic law of the modalities on each area.

INDIVIDUALS :

1 Northern metropolitan

2 North non-metropolitan

3 Yorks and humberside metropoli

4 Yorks and humberside non-metro

5 East midlands non-metropolitan

6 North west metropolitan

....

VARIABLES :

....

2 Central heating in property

....

....

9 QWEtelephone

1 AJ01 v=0

2 AJ02 v1-5

3 AJ03 v6-10

4 AJ04 v>10

....

MATRIX :

..... Var 9 .....

Ind 1 1(0.140), 2(0.640), 3(0.166), 4(0.053)

Ind 2 1(0.126), 2(0.561), 3(0.234), 4(0.076)

Ind 3 1(0.116), 2(0.576), 3(0.220), 4(0.087)

Ind 4 1(0.095), 2(0.600), 3(0.235), 4(0.070)

Ind 5 1(0.114), 2(0.590), 3(0.215), 4(0.079)

Ind 6 1(0.141), 2(0.541), 3(0.227), 4(0.089)

....


 
 
 
 
 
 
 
 
 
 
 
 
 
 

d) Biplot for interval variables

This graphic presents a symbolique object of the array like a rectangle in the plane of two variables choosen by the user. Each side of the rectangle is the range of the axis variable on the symbolique object.

In the following example, individuals are species of dogs and the variables choosen for the biplot representation are the weight and the neck height.

INDIVIDUALS :

1 Caniche

2 Chihuahua

3 Pekinois

.....

9 Mastiff

.....

13 SaiBer

VARIABLES :

1 Hauteur du garrot (neck height in cm)

2 Poids (weight in kg)

........

MATRIX :

Var 1 Var 2 .....

Ind 1 20:35 15:25

Ind 2 16:20 0.9:3.5

Ind 3 20:25 3:5

.....

Ind 9 75:75 100:100

.....

Ind 13 70:70 55:80


 
 
 
 
 
 

  1. Installation

  2. STAT comes as part of the SODAS package and does not need any specific installation task.

  3. Running stat
STAT being handled through the workbench, the following steps have to be followed (see relevant workbench documentation for details):

4.1. Insert the STAT method in the chaining

N.B.: The term "method" may be confusing: although STAT incorporates various methods statistically speaking, it is seen as a single "method" by the workbench, which manages it as a whole.

From there on, the term method will apply to each method inside STAT.

4.2. Parametrize STAT

This is done in two steps, involving two dialog boxes:

    1. selecting the variables
    2. selecting the method to apply
The workbench prompts for one or several methods, depending on the types of variables.
 
 

4.3. Execute STAT

This is done within the chainig execution managed by the workbench.

4.4. Look at the output

After execution, the listing icon and the graph icon will show up next to the STAT icon. There are two exceptions:

The next two sections indicate how to use the listing and the graph.
  1. Using the listing

  2. The listing contains all the outcome from the method execution, whereas it generally takes several graphs to scan all the results (roughly, as many graphs as variables).

    A header indicates the SODAS base used, the method selected and the date and time of execution, as shown below:

    --------------------------------------------------------------------

    SODAS - STAT RELATIVE FREQUENCIES (MODAL) Nov 01 1999 18:36

    File: MUSHROO3.SDS

    Title: Mushrooms

    --------------------------------------------------------------------

    Cap presence

    pres present 0.7500

    .........

    The header is followed by detailed results which depend on the method selected.

    Necessary identification is ensured by copying titles, labels and short IDs from the SODAS base.

    N.B.: The biplot is not a method in itself since there is no computation and the only goal is to display a graph of the objects (as boxes); the listing is then irrelevant, by lack of results. Nevertheless, a listing is created in order to keep track of what has been done, and to give a list of the objects involved, for information.

    Conversely, there is no graph for the central object, so that the listing is the only way to get the results.

  3. Using the graph
The STAT graph is featured with many options, some of them may or may not apply depending on the selected method and operation underway.

It is driven by a menu, which makes it self-explanatory for most functions. Moreover, some functions are usual ones such as Saving, Copying, Printing...

The menu items are dynamically managed so that, according to the context, non applicable items are disabled and labelling of some items may change.

For ease of use, some menu items likely to be used repeatedly are mirrored in toolbar buttons (e.g. graph refresh button), as well as those commonly found in toolbars (e.g. Save, Copy, ..).

This section only details the items that are specific to STAT.

6.1 Menu

Since the menu is divided into groups of logically near functions, this description follows the menu sequence.

1. File

Recalls a previously saved graph (see Save option below).

A file selection dialog box pops up. The type of file is .GRF.

Recalling a graph implies leaving the current one; a message box prompts the user for saving it or not.

The recalled graph keeps the same look and same editing and handling facilities as the original one, and can in turn be saved after editing.

Saves the current graph.

The file name is kept from the SODAS file name, with the extension .GRF.

To save under another name, you may use the Save As option below.

This file may later be recalled with the option Open (see above), and reused as if the graph had not been left meanwhile.

Saves the current graph in either of two formats:

a. the internal STAT format .GRF, like Save above

b. the standard Windows bitmap format .BMP

A file selection dialog box pops up. Proper formats are

enforced by the program.

Standard printing commands.

STAT supports both portrait and landscape layouts.

STAT prompts for saving the graph before exiting.

2. Edit

As usual.

3. View

As usual.

4. Process

Displays a list box of the variables for selection.

Only the variables previously selected in the workbench are available.

This is a context sensitive item.

Opens a scrollable window listing the symbolic objects from the SODAS base.

In addition, clicking on an object in that list displays its values for the applicable variables.

Applies only to Biplot.

Allows to select which objects are to be displayed on the graph (by default, all objects are displayed).

Displays the listing in a scrollable window.

Avoids having to go back to the workbench for looking at the listing when already in the graph.

5. Draw

It is one of two ways to represent the relative frequencies for multinominal variables, the other being the bar chart.

This is a context sensitive item.

This is a dual purpose item for bar charts :
    1. bar chart of relative frequencies for multinominal variables b) bar chart of union capacities for probabilistic variables.
This is a context sensitive item, with label swapping. It is another way to represent the probabilistic variables.

This is a context sensitive item.

Displays values on top of the bars in either the histogram or the bar charts (by default, no value is displayed).

This is a context sensitive item.

Self-explanatory.

This is a context sensitive item.

The graph aims to best fill the available display area, so that it copes with the actual ranges found with the objects, rather than the theoretical ranges derived from the variable limits.

As a result, if the leftmost horizontal limit is positive, the origin will not be in the graph.

This item then enforces inclusion of the origin.

This is a context sensitive item.

Similarly to the above, for bars showing values always lower than or equal to 1 (probabilities), the topmost value may be lower than 1.

This item then enforces extension of the vertical range up to 1.

This is a context sensitive item.

For graphs showing object labels, i.e. the biplot and the histogram of the relative frequencies for multinominal variables, this item allows to show either the regular label or the four character short ID, alternately.

Using the short ID improves graph readability in case of rather long labels.

By default, the short ID is selected.

This is a context sensitive item.

Applies only to Biplot.

By default, the labels are displayed, and start from over the top left corner of the boxes.

This is a context sensitive item.

Allows to insert a text in the graph, at the position indicated by a mouse click (the mouse cursor is changed on purpose).

A dialog box shows up, with an editing zone and a pushbutton for getting access to the Windows font and text color selection dialogs.

When returning from the dialog, the text is displayed.

The text can be moved, changed and deleted, all through mouse actions (see relevant section later in this document).

It should be noted that almost any text in the graph, either inserted by the program or by the user, may be handled that way; only exception: the scale markers, that are static.

For redrawing the entire graph.

An alternate and quicker way is to use the associated toolbar button or, even quicker, the Space key.

6. Help

As usual. Displays an abstract of the editing functions using the mouse and the keyboard.

See details in the following section.

6.2 Mouse and keyboard

Help on the mouse and keyboard actions can be displayed on-line at any time by selecting the relevant item in the Help popup menu.

One of the following windows is displayed, depending on whether the current graph is a biplot or not.


 
 

  1. Error messages

  2. The way STAT reports errors depends on which main phase it is currently in:

    1. In the method execution phase, it reports the errors in the LOG file, according to the workbench requirements.

    This file can be accessed via the listing icon, which in case of error reporting is red-crossed and directs to the LOG file instead of the listing file.

    2. In the graphics phase, it can no longer report errors in the LOG file, since that LOG is an alternate to the listing file in case of errors during method execution, not a complement, so it is

    "too late".

    It then reports errors by way of message boxes.

  3. References
P.Bertrand, F.Goupil - " Descriptive Statistiques for Symbolics Objects. " - Symbolique Data Analysis in SODAS to appear in 2000. Editeurs : Diday,Bock,Bertrand,Springer-Verlag Edition.

J.L.Blanchard, H.Augendre – "Introduction de méthodes symboliques dans un logiciel statistique classique" - EDF-DER(1993).

A.Chouakria, P.Cazes, E.Diday – "Codage de variables intervalles en vue d’une analyse factorielle des correspondances multiples" - Journées ASU, Carcassonne(1997).

E.Diday, R.Emilion - "Treillis et capacités en analyse d’objets probabilistes"(1996).