PYR : Pyramidal Clustering

Rodriguez OLDEMAR
LISE-CEREMADE
University Paris IX Dauphine 75775 Paris France
e-mail: florita.CESPEDES@wanadoo.fr


1 Presentation of PYR *

2 The input *

3 The parameters definitions *

3.1 List of the selected variables *

3.2 The parameter definition *

4 The output *

5 Error messages 6

* References 8

  1. Presentation of PYR

  2.  

     
     

    This pyramidal clustering model generalizes hierarchies by allowing non-disjoint classes at a given level instead a partition. Moreover, the clusters of the pyramid are intervals of a total order on the set being clustered, hence pyramids constitute an intermediate model between the tree and the lattice structures. The proposed method allows to cluster more complex data than the tabular model allows to process, by considering variation on the values taken by the variables. The pyramid is built by an agglomerative bottom-up algorithm. In symbolic pyramidal clustering each cluster formed is defined not only by the set of its elements – its extension – but also by a symbolic object, which describes their properties – the intension of the cluster. The intension is inherited from predecessor to successor, thus leading to an inheritance structure. The order structure allows to identify intermediate concepts, in other words, concepts that bridge the gap between well-identified classes.
     
     

  3. The input

  4. The input of the PYR method is a symbolic data matrix or a distance matrix. If the input is a symbolic data matrix the output will be a symbolic pyramid, but if the input is distance matrix the output will be a numerical (classical) pyramid. A symbolic data matrix is a symbolic data array that records complex data types (symbolic data types), where a complex data could be a set of real values, the set of intervals, a set of nominal categories or a set of probability distributions. This data matrix is save in an ASCII file which extension is *.SDS.

    If the input is a symbolic data matrix the aggregation criterion will be always the "Generality Degree", and if the input is a distance matrix then the aggregation criterion will be the "Maximum".

    We take as an example the symbolic data matrix the file WAVE.SDS; this matrix is here composed of 30 symbolic objects described by 21 symbolic interval variables. The "BASE" (see Figure 1.) is the file WAVE.SDS.
     
     

    FIGURE 1. Chaining example


  5. The parameters definitions

  6. The PYR method takes also as input a list of variables and some parameters we will now define.

    1. List of the selected variables

    2. The user must choose the variables, which will be used to build the pyramid. The variables can be of continuous type (real values), interval (interval of real values) or histogram type (a set of probability distributions), and they can be mixed in the same SDS file.

      To choose the variables in the SODAS PYR’s variable selection window (see FIGURE 2.), the user has to choose between qualitative and continuous variables and it is possible to mix them. In the wave example, the 21 continuous variables position1 to position21 are selected.
       
       


      FIGURE 2. The SODAS window for the selection of the variables


    3. The parameter definition
  7. Three parameters have to be defined:
  1. If PYR_SAT=0 then program builds a non saturate pyramid, but if PYR_SAT=1 the program builds a saturate pyramid, this means with maximum number of nodes.
  1. If TYPE_PYR=0 then program builds a symbolic pyramid, but if TYPE_PYR=1 the program builds a numerical pyramid.
  1. If VAR_ORDER=0 the algorithm finds an order compatible with pyramid, but if VAR_ORDER=N then the algorithm builds a pyramid with an order given a "priori" by the variable number N. This variable should be ordinal.
The previous three parameters can be selected in the dialog box that is presented in the FIGURE 3.

FIGURE 3. The Pyramid SODAS window for the definition of the parameters


 
 
  1. The output
After having defined the parameters and run the PYR method, a listing is given as output (see Figure 4.) ; also, the user can run the pyramid graphical editor by double clicking over the pyramid icon to see the graph of the pyramid.

FIGURE 4. The chaining after running the PYR method



The listing contains:

1.="wave_1_3"

2.="wave_1_1"

3.="wave_1_6"

4.="wave_1_9"

5.="wave_1_5"

6.="wave_1_7"

7.="wave_1_2"

8.="wave_1_8" ..............

y1.=position 1

y2.=position 2

y3.=position 3

y4.=position 4

y5.=position 5

y6.=position 6

y7.=position 7

y8.=position 8

P84=[y1=position_1=[-3.290,2.830]]^[y2=[-2.940,3.260]]^[y3=[ 0.080,3.310]]......... Ext(P84)={2,21,29,30}

The graphical output is shown in the Figure 5. With this editor the user can change the scale of the pyramid, print the pyramid and also open the SDS file associated to the pyramid.
 
 

FIGURE 5. Graphical ouput of the PYR method

  1. Error messages
If there is an error during the execution of PYR method SODAS shows the following message:
 
 
 
 

FIGURE 6. Error message of SODAS when PYR method fail

To know what is the possible error, the user should click over the error icon then the program shows one of the following messages:

  1. "ERROR "
"The algorithm fails"

"The SDS file does not have a dissimilarity matrix (TRIANGLE_MATRIX)"

This message means that the user selected the option to build a numerical pyramid, but the SDS file doesn’t have a distance matrix.

  1. "ERROR "
"You have to select at least 2 variables."

This message means that the user selected the less than 2 variables, but the algorithm needs at least 2 variables to run.

  1. "ERROR "
"SODAS file has an interval variable with a=b for each individual"

"Then it is impossible to use the Degree of Generality in this case."

This message means that there are variables of interval type and some value taken by one of these variables has the minimum of the interval equal to the maximum, so it is impossible to compute the Degree of Generality because this problem produces an 0/0.

  1. "ERROR "
"SODAS file does not have the correct format or the type of the variables are not supported by this version. There are two possible mistakes that produce this message, maybe the SDS file has variables that are not continuous, interval or histogram type or maybe there was an SODAS parsification 7.
 
 
  1. References

Bertrand P. Etude de la représentation pyramidale, Thèse de 3 cycle, Université Paris IX-Dauphine, 1986.

Bertrand P et Diday E. Une géneralisation des arbres hiérarchiques: Les représentations pyramidales, Statistique Appliquée, (3), 53-78, 1990.

Brito P. Analyse de donnees symboliques: Pyramides d'heritage, Thèse de doctorat, Université Paris IX Dauphine, 1991.

Brito P. Symbolic pyramidal clustering, Indo--French Workshop on symbolic data analysis ans its applications, Université Paris 9 Dauphine, 1997.

Diday E. Une représentation visuelle des classes empiétantes. Rapport INRIA n. 291. Rocquencourt 78150, France, 1984.

Mfoumoune E. Les aspects algorithmiques de la classification ascendante pyramidale et incrémentale. Thèse de doctorat, Université Paris 9 Dauphine, 1998.