Rodriguez OLDEMAR
LISE-CEREMADE
University Paris IX Dauphine 75775 Paris France
e-mail: florita.CESPEDES@wanadoo.fr
1 Presentation of PYR *
2 The input *
3 The parameters definitions *
3.2 The parameter definition *
5 Error messages 6
* References 8
This pyramidal clustering model generalizes hierarchies by allowing
non-disjoint classes at a given level instead a partition. Moreover, the
clusters of the pyramid are intervals of a total order on the set being
clustered, hence pyramids constitute an intermediate model between the
tree and the lattice structures. The proposed method allows to cluster
more complex data than the tabular model allows to process, by considering
variation on the values taken by the variables. The pyramid is built by
an agglomerative bottom-up algorithm. In symbolic pyramidal clustering
each cluster formed is defined not only by the set of its elements – its
extension – but also by a symbolic object, which describes their properties
– the intension of the cluster. The intension is inherited from predecessor
to successor, thus leading to an inheritance structure. The order structure
allows to identify intermediate concepts, in other words, concepts that
bridge the gap between well-identified classes.
The input of the PYR method is a symbolic data matrix or
a distance matrix. If the input is a symbolic data matrix the output
will be a symbolic pyramid, but if the input is distance matrix the output
will be a numerical (classical) pyramid. A symbolic data matrix is a symbolic
data array that records complex data types (symbolic data types), where
a complex data could be a set of real values, the set of intervals, a set
of nominal categories or a set of probability distributions. This data
matrix is save in an ASCII file which extension is *.SDS.
If the input is a symbolic data matrix the aggregation criterion will be always the "Generality Degree", and if the input is a distance matrix then the aggregation criterion will be the "Maximum".
We take as an example the symbolic data matrix the file WAVE.SDS; this
matrix is here composed of 30 symbolic objects described by 21 symbolic
interval variables. The "BASE" (see Figure 1.) is the file WAVE.SDS.
FIGURE 1. Chaining example
The PYR method takes also as input a list of variables and some
parameters we will now define.
The user must choose the variables, which will be used to build
the pyramid. The variables can be of continuous type (real values), interval
(interval of real values) or histogram type (a set of probability distributions),
and they can be mixed in the same SDS file.
To choose the variables in the SODAS PYR’s variable selection window
(see FIGURE 2.), the user has to choose between qualitative and continuous
variables and it is possible to mix them. In the wave example, the 21 continuous
variables position1 to position21 are selected.
FIGURE 2. The SODAS window for the selection of the variables
FIGURE 3. The Pyramid SODAS window for the definition of the parameters
FIGURE 4. The chaining after running the PYR method
The listing contains:
2.="wave_1_1"
3.="wave_1_6"
4.="wave_1_9"
5.="wave_1_5"
6.="wave_1_7"
7.="wave_1_2"
8.="wave_1_8" ..............
y2.=position 2
y3.=position 3
y4.=position 4
y5.=position 5
y6.=position 6
y7.=position 7
y8.=position 8
The graphical output is shown in the Figure 5. With
this editor the user can change the scale of the pyramid, print the pyramid
and also open the SDS file associated to the pyramid.
FIGURE 5. Graphical ouput of the PYR method
FIGURE 6. Error message of SODAS when PYR method fail
To know what is the possible error, the user should click over the error icon then the program shows one of the following messages:
"The SDS file does not have a dissimilarity matrix (TRIANGLE_MATRIX)"
This message means that the user selected the option to build a numerical pyramid, but the SDS file doesn’t have a distance matrix.
This message means that the user selected the less than 2 variables, but the algorithm needs at least 2 variables to run.
"Then it is impossible to use the Degree of Generality in this case."
This message means that there are variables of interval type and some value taken by one of these variables has the minimum of the interval equal to the maximum, so it is impossible to compute the Degree of Generality because this problem produces an 0/0.
Bertrand P. Etude de la représentation pyramidale, Thèse de 3 cycle, Université Paris IX-Dauphine, 1986.
Bertrand P et Diday E. Une géneralisation des arbres hiérarchiques: Les représentations pyramidales, Statistique Appliquée, (3), 53-78, 1990.
Brito P. Analyse de donnees symboliques: Pyramides d'heritage, Thèse de doctorat, Université Paris IX Dauphine, 1991.
Brito P. Symbolic pyramidal clustering, Indo--French Workshop on symbolic data analysis ans its applications, Université Paris 9 Dauphine, 1997.
Diday E. Une représentation visuelle des classes empiétantes. Rapport INRIA n. 291. Rocquencourt 78150, France, 1984.
Mfoumoune E. Les aspects algorithmiques de la classification ascendante pyramidale et incrémentale. Thèse de doctorat, Université Paris 9 Dauphine, 1998.