User Manual of the SODAS method DI:
Dissimilarity Measures and Matching
Operators
A Tool for Data Warehouse and Data Mining
Donato MALERBA
Dipartimento di Informatica, University of Bari, Italy
1Presentation of DI *
2 The input *
3 The parameters definitions *
3.1 List of the selected variables *
3.3 The parameter definition *
5 Bibliography *
Several data analysis techniques are based on defining and quantifying
a dissimilarity measure between the underlying objects. The method DI (Dissimilarities
& Matching) implements several dissimilarity measures between Boolean
symbolic objects (BSO’s). The method DI also implements a canonical and
a flexible matching operator. Matching is the process of comparing two
or more structures to discover their likeness or difference. The definition
of matching operators for BSO’s is deemed important for the development
of several symbolic data analysis techniques, such as factor analysis.
The input of the DI method is a SODAS file with a data matrix
of BSO.
We take as an example the symbolic transformation of the waveform recognition data proposed L. Breiman, J.H. Friedman, R.A. Oslhen and C. J. Stone in "Classification and Regression Trees"; Belmont Eds, 1984.
The symbolic data matrix is here composed of 30 symbolic objects described by 21 symbolic interval variables. The "BASE" (see FIG1.) is the file wave30.sds.
FIG1. Chaining example
The DI method takes also as input a list of variables and some parameters
we will define now.
FIG 2. The window for the selection of the variables
The Workbench enables the specification of both the chosen dissimilarity measure and some related parameters (see Table 1).
Table 1. Dissimilarity measures available in the DI method for
BSO’s, and related parameters.
| Dissimilarity measure | Parameters | Constraints | Default |
| U_1 (Gowda & Diday) | none | ||
| U_2 (Ichino & Yaguchi) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
| U_3 (Normalized Ichino & Yaguchi) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
| U_4 (Weighted Normalized Ichino & Yaguchi) | Gamma
Order of power List of weights per variable |
[0 .. 0.5]
1 .. 10 Sum(weights) = 1.0 |
0.5
2 Equal weights |
| C_1 (Normalized De Carvalho) | Comparison function
Order of power |
D1, D2, D3,
D4, D5
1 .. 10 |
D1
2 |
| SO_1 (De Carvalho) | Comparison function
Order of power List of weights per variable |
D1, D2, D3,
D4, D5
1 .. 10 Sum(weights) = 1.0 |
D1
2 Equal weights |
| SO_2 (De Carvalho) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
| SO_3 (De Carvalho) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
| SO_4 (Normalized De Carvalho) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
| SO_5 (Normalized De Carvalho) | Gamma
Order of power |
[0 .. 0.5]
1 .. 10 |
0.5
2 |
The Workbench also enables the selection of both a canonical and a flexible matching operator together with some related parameters (see Table 2).
Table 2. Matching operators available in the DI method for BSO’s,
and related parameters.
| Matching operator | Parameters | Constraints | Default |
| CM (Canonical matching) | Class BSO | 1 .. no. BSO | 2 |
| FM (Flexible matching) | Class BSO | 1 .. no. BSO | 2 |
The "Class BSO" represents the BSO that will be used as referent in the matching (by default it is the second BSO). In the case of canonical matching the subject to be matched can be any BSO, while in the case of flexible matching the subject must be a BSO representing an individual, that is each variable must take on a single value.
In the waveform example (see FIG3.) we have chosen the normalized dissimilarity by Ichino & Yaguchi, with Gamma=0.5 and order of power equal to 2.
FIG 3. The SODAS window for the definition of the parameters
FIG 4. A new chaining with the DKS method and a SODAS file generated by the DI method.
More information can be found in the output report file (see FIG. 5).
FIG 5. The chaining after running the DI method
The report file is structured as the listing of the SPSS/PC software when some distance-based clustering procedures are selected.
The report contains :
Page 1 SODAS 01/02/00
Sodas The Statistical Package for Symbolic Data Analysis
**************D I S T A N C E M E A S U R E S*************
Data Information:
Selected Distance Function: U_3 Normalized Ichino & Yaguchi
30 Boolean Symbolic Objects (BSOs) read.
21 Variables selected for each BSO: 1 -- 21
Distance Matrix
BSO 1 2 3 4
1 0.0000
2 0.3568 0.0000
3 0.5854 0.8574 0.0000
4 1.0359 1.3228 0.5445 0.0000
5 0.4191 0.7005 0.3004 0.6821
-------------------------------------------------------------------------
Page 2 SODAS 01/02/00
BSO 1 2 3 4
6 0.7293 1.0312 0.3326 0.3961
7 0.2634 0.2909 0.7514 1.2017
8 0.8760 1.1604 0.4457 0.3485
9 1.1640 1.4517 0.6396 0.3010
10 0.2761 0.5191 0.3930 0.8532
11 0.9493 1.1398 0.7867 1.0015
12 0.8982 1.0933 0.7056 0.9187
13 1.0239 1.3079 0.5530 0.4592
14 0.9250 1.1933 0.5333 0.6090
15 0.9272 1.1704 0.6221 0.7662
16 0.9464 1.1686 0.6320 0.7872
17 1.2169 1.5177 0.7114 0.3360
18 1.1544 1.4438 0.6722 0.3085
19 0.9787 1.2512 0.5413 0.4992
20 1.0574 1.3349 0.5676 0.3986
21 0.3467 0.2166 0.8632 1.3339
22 0.4291 0.5158 0.6684 1.1143
23 0.9119 1.0825 0.7861 1.0522
24 0.3956 0.4576 0.7001 1.1630
25 0.7238 0.8748 0.7065 1.0639
-------------------------------------------------------------------------
Page 3 SODAS 01/02/00
BSO 1 2 3 4
26 0.5730 0.7051 0.6381 1.0540
27 0.5873 0.6917 0.6584 1.0783
28 0.7946 0.9523 0.7575 1.0581
29 0.3752 0.3972 0.7417 1.2085
30 0.3822 0.2831 0.8359 1.2991
BSO 5 6 7 8
5 0.0000
6 0.4023 0.0000
7 0.5940 0.8833 0.0000
8 0.5517 0.3003 1.0223 0.0000
9 0.7827 0.4945 1.3253 0.4385
10 0.2753 0.5643 0.4345 0.6907
11 0.8078 0.8417 1.0369 0.8898
12 0.7476 0.7850 0.9833 0.8241
13 0.6882 0.4800 1.1794 0.4766
14 0.6416 0.5272 1.0680 0.5669
15 0.6874 0.6249 1.0511 0.6459
16 0.7085 0.6659 1.0592 0.7000
-------------------------------------------------------------------------
Page 4 SODAS 01/02/00
BSO 5 6 7 8
17 0.8536 0.5470 1.3779 0.4466
18 0.8014 0.4968 1.3024 0.4006
19 0.6475 0.4942 1.1237 0.4865
… … … … …
BSO 29 30
-----------------------------------------------------------------------
Page 9 SODAS 01/02/00
BSO 29 30
29 0.0000
30 0.2882 0.0000
This procedure was completed
at 11:45:58
When the user selects a matching operator the DI method outputs a report file alone. The report contains :
Page 1 SODAS 01/02/00
Sodas The Statistical Package for Symbolic Data Analysis
*********** C A N O N I C A L M A T C H I N G **********
Data Information
30 Boolean Symbolic Objects (BSOs) read.
2 : Boolean Symbolic Object (BSO) selected as class.
Matching Vector
BSO 2
1 No Match
2 Match
3 No Match
4 No Match
5 No Match
6 No Match
7 No Match
8 No Match
-------------------------------------------------------------------------
Page 2 SODAS 01/02/00
BSO 2
9 No Match
10 No Match
11 No Match
12 No Match
13 No Match
14 No Match
15 No Match
16 No Match
17 No Match
18 No Match
19 No Match
20 No Match
21 No Match
22 No Match
23 No Match
24 No Match
25 No Match
26 No Match
27 No Match
28 No Match
-------------------------------------------------------------------------
Page 3 SODAS 01/02/00
BSO 2
29 No Match
30 No Match
This procedure was completed at 12:37:01
Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical resemblance measures. Chapter 8.1 in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000.Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4 in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000.
F. Esposito, D. Malerba, & F.A. Lisi (1998). Flexible Matching of Boolean Symbolic Objects. Proceedings of NTTS'98, Int. Seminar on New Techniques & Technologies for Statistics, 157-162.
F. Esposito, D. Malerba, V. Tamma. Dissimilarity measures in symbolic
data analysis. In Book of Short Papers CLADAG 99, 137-140, CNR,
Rome, 1999.