User Manual of the SODAS method DI:

Dissimilarity Measures and Matching Operators
 
 

A Tool for Data Warehouse and Data Mining

Donato MALERBA
Dipartimento di Informatica, University of Bari, Italy





1Presentation of DI *

2 The input *

3 The parameters definitions *

        3.1 List of the selected variables *

3.2 The output file definition *

3.3 The parameter definition *

4 The output *

5 Bibliography *

  1. Presentation of DI

  2. Several data analysis techniques are based on defining and quantifying a dissimilarity measure between the underlying objects. The method DI (Dissimilarities & Matching) implements several dissimilarity measures between Boolean symbolic objects (BSO’s). The method DI also implements a canonical and a flexible matching operator. Matching is the process of comparing two or more structures to discover their likeness or difference. The definition of matching operators for BSO’s is deemed important for the development of several symbolic data analysis techniques, such as factor analysis.
     

  3. The input

  4. The input of the DI method is a SODAS file with a data matrix of BSO.

    We take as an example the symbolic transformation of the waveform recognition data proposed L. Breiman, J.H. Friedman, R.A. Oslhen and C. J. Stone in "Classification and Regression Trees"; Belmont Eds, 1984.

    The symbolic data matrix is here composed of 30 symbolic objects described by 21 symbolic interval variables. The "BASE" (see FIG1.) is the file wave30.sds.

    FIG1. Chaining example


  5. The parameters definitions

  6. The DI method takes also as input a list of variables and some parameters we will define now.
     

    1. List of the selected variable
The user must choose the variables that will be used to compute either the dissimilarity matrix or the matching operator. When you choose a list of variable take care that: The variable selection window is reported in FIG 2. In the waveform example, the 21 continuous variables position1 to position21 are selected.

FIG 2. The window for the selection of the variables

  1. The output file definition

  2. When a dissimilarity measure is selected, the users has to define an output SODAS file as well (see the button "File created …" in FIG 2). On the contrary, no specification is required when a matching operator is selected.
     

  3. The parameter definition
The Workbench enables the specification of both the chosen dissimilarity measure and some related parameters (see Table 1).

Table 1. Dissimilarity measures available in the DI method for BSO’s, and related parameters.
 

Dissimilarity measure Parameters Constraints Default
U_1 (Gowda & Diday) none    
U_2 (Ichino & Yaguchi) Gamma

Order of power

[0 .. 0.5]

1 .. 10 

0.5

U_3 (Normalized Ichino & Yaguchi) Gamma

Order of power

[0 .. 0.5]

1 .. 10 

0.5

2

U_4 (Weighted Normalized Ichino & Yaguchi) Gamma

Order of power

List of weights per variable

[0 .. 0.5]

1 .. 10

Sum(weights) = 1.0

0.5

2

Equal weights

C_1 (Normalized De Carvalho) Comparison function

Order of power

D1, D2, D3, D4, D5

1 .. 10 

D1

2

SO_1 (De Carvalho) Comparison function

Order of power

List of weights per variable

D1, D2, D3, D4, D5

1 .. 10

Sum(weights) = 1.0

D1

2

Equal weights

SO_2 (De Carvalho) Gamma

Order of power

[0 .. 0.5]

1 .. 10

0.5

2

SO_3 (De Carvalho) Gamma

Order of power

[0 .. 0.5]

1 .. 10

0.5

2

SO_4 (Normalized De Carvalho) Gamma

Order of power

[0 .. 0.5]

1 .. 10

0.5

2

SO_5 (Normalized De Carvalho) Gamma

Order of power

[0 .. 0.5]

1 .. 10

0.5

2

The Workbench also enables the selection of both a canonical and a flexible matching operator together with some related parameters (see Table 2).

Table 2. Matching operators available in the DI method for BSO’s, and related parameters.
 

Matching operator Parameters Constraints Default
CM (Canonical matching) Class BSO 1 .. no. BSO 2
FM (Flexible matching) Class BSO 1 .. no. BSO 2

The "Class BSO" represents the BSO that will be used as referent in the matching (by default it is the second BSO). In the case of canonical matching the subject to be matched can be any BSO, while in the case of flexible matching the subject must be a BSO representing an individual, that is each variable must take on a single value.

In the waveform example (see FIG3.) we have chosen the normalized dissimilarity by Ichino & Yaguchi, with Gamma=0.5 and order of power equal to 2.

FIG 3. The SODAS window for the definition of the parameters

  1. The output
When the user selects a dissimilarity measure the method outputs both a SODAS file and a report. The SODAS file has the same input data and an additional distance matrix. The distance between the i-th and the j-th BSO’s is written in the entry (i,j) of the matrix. Since the matrix is symmetric only the lower part is actually reported in the SODAS file. The new SODAS file can now be selected as a base file in a new chaining with another method (e.g., DKS) having access to the distance matrix (see FIG. 4)

FIG 4. A new chaining with the DKS method and a SODAS file generated by the DI method.

More information can be found in the output report file (see FIG. 5).

FIG 5. The chaining after running the DI method

The report file is structured as the listing of the SPSS/PC software when some distance-based clustering procedures are selected.

The report contains :

For instance, the report obtained with the waveform example is the following :

Page 1 SODAS 01/02/00

Sodas The Statistical Package for Symbolic Data Analysis

**************D I S T A N C E M E A S U R E S*************

Data Information:

Selected Distance Function: U_3 Normalized Ichino & Yaguchi

30 Boolean Symbolic Objects (BSOs) read.

21 Variables selected for each BSO: 1 -- 21

Distance Matrix

BSO 1 2 3 4

1 0.0000

2 0.3568 0.0000

3 0.5854 0.8574 0.0000

4 1.0359 1.3228 0.5445 0.0000

5 0.4191 0.7005 0.3004 0.6821

-------------------------------------------------------------------------

Page 2 SODAS 01/02/00

BSO 1 2 3 4

6 0.7293 1.0312 0.3326 0.3961

7 0.2634 0.2909 0.7514 1.2017

8 0.8760 1.1604 0.4457 0.3485

9 1.1640 1.4517 0.6396 0.3010

10 0.2761 0.5191 0.3930 0.8532

11 0.9493 1.1398 0.7867 1.0015

12 0.8982 1.0933 0.7056 0.9187

13 1.0239 1.3079 0.5530 0.4592

14 0.9250 1.1933 0.5333 0.6090

15 0.9272 1.1704 0.6221 0.7662

16 0.9464 1.1686 0.6320 0.7872

17 1.2169 1.5177 0.7114 0.3360

18 1.1544 1.4438 0.6722 0.3085

19 0.9787 1.2512 0.5413 0.4992

20 1.0574 1.3349 0.5676 0.3986

21 0.3467 0.2166 0.8632 1.3339

22 0.4291 0.5158 0.6684 1.1143

23 0.9119 1.0825 0.7861 1.0522

24 0.3956 0.4576 0.7001 1.1630

25 0.7238 0.8748 0.7065 1.0639

-------------------------------------------------------------------------

Page 3 SODAS 01/02/00

BSO 1 2 3 4

26 0.5730 0.7051 0.6381 1.0540

27 0.5873 0.6917 0.6584 1.0783

28 0.7946 0.9523 0.7575 1.0581

29 0.3752 0.3972 0.7417 1.2085

30 0.3822 0.2831 0.8359 1.2991

BSO 5 6 7 8

5 0.0000

6 0.4023 0.0000

7 0.5940 0.8833 0.0000

8 0.5517 0.3003 1.0223 0.0000

9 0.7827 0.4945 1.3253 0.4385

10 0.2753 0.5643 0.4345 0.6907

11 0.8078 0.8417 1.0369 0.8898

12 0.7476 0.7850 0.9833 0.8241

13 0.6882 0.4800 1.1794 0.4766

14 0.6416 0.5272 1.0680 0.5669

15 0.6874 0.6249 1.0511 0.6459

16 0.7085 0.6659 1.0592 0.7000

-------------------------------------------------------------------------

Page 4 SODAS 01/02/00

BSO 5 6 7 8

17 0.8536 0.5470 1.3779 0.4466

18 0.8014 0.4968 1.3024 0.4006

19 0.6475 0.4942 1.1237 0.4865

… … … … …
 
 

BSO 29 30

-----------------------------------------------------------------------

Page 9 SODAS 01/02/00

BSO 29 30

29 0.0000

30 0.2882 0.0000

This procedure was completed at 11:45:58
 
 

When the user selects a matching operator the DI method outputs a report file alone. The report contains :

For instance, the report obtained with the waveform example is the following :

Page 1 SODAS 01/02/00

Sodas The Statistical Package for Symbolic Data Analysis

*********** C A N O N I C A L M A T C H I N G **********

Data Information

30 Boolean Symbolic Objects (BSOs) read.

2 : Boolean Symbolic Object (BSO) selected as class.

Matching Vector

BSO 2

1 No Match

2 Match

3 No Match

4 No Match

5 No Match

6 No Match

7 No Match

8 No Match

-------------------------------------------------------------------------

Page 2 SODAS 01/02/00

BSO 2

9 No Match

10 No Match

11 No Match

12 No Match

13 No Match

14 No Match

15 No Match

16 No Match

17 No Match

18 No Match

19 No Match

20 No Match

21 No Match

22 No Match

23 No Match

24 No Match

25 No Match

26 No Match

27 No Match

28 No Match

-------------------------------------------------------------------------

Page 3 SODAS 01/02/00

BSO 2

29 No Match

30 No Match

This procedure was completed at 12:37:01

            Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical resemblance measures. Chapter 8.1 in H.-H. Bock, E.   Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000.