3.3 Waveform example
5.2 Splitting step
5.3 More details on each leaf or terminal node
5.4 Final result
In output, we obtain a new list of symbolic objects that permit to assign
future new objects to one class of the known prior partition.
We take as an example the symbolic transformation of the waveform recognition data proposed L. Breiman, J.H. Friedman, R.A. Oslhen and C. J. Stone in "Classification and Regression Trees"; Belmont Eds, 1984.
The symbolic data matrix is here composed of 30 symbolic objects described by 21 symbolic interval variables and a classification variable which is a qualitative variable. The "BASE" (see FIG1.) is the file wave30.sds.
FIG1. Chaining example
*default value : number of OS/2
*default value : Fuzzy
*default value : Gini for Pure or Likehood for Fuzzy
If Soft Assignment is Fuzzy then Splitting criterion is Likelihood
If Soft Assignment is Pure then Splitting criterion is Gini or Information.
*default value : 5
If the size of the leaf is smaller than Minimum size to split the node this leaf is not split.
*default value : 2
*default value : 1
*default value : 0.
FIG2. The SODAS window for the selection of the predictor variables
After you must selected only one qualitative variable which represents classification variable.
FIG3. The SODAS window for the selection of the classification variable
In the waveform example (see FIG3.) we have chosen a dissimilarity with no normalization, the number of classes is 3 which mean that the DIV method will perform a partition in two classes and a partition in 3 classes, and we have chosen not to create a partition file.
FIG4. The SODAS window for the definition of the parameters
After having defined the parameters and run the TREE method, a listing is given as output (see FIG5.)
FIG5. The chaining after running the DIV method
ID_CLASS NAME_CLASS 1 wave_1 2 wave_2 3 wave_3 CLASS SIZE LEARNING TEST 1 10 10 0 2 10 10 0 3 10 10 0 TOTAL 30 30 0In this case the test set is empty and each a priori class contains 10 symbolic objects. The first table gives the correspondence between the name NAME_CLASS and the number ID_CLASS for each a priori classes.
The aim is to build the set of binary questions that will be used in
order to grow the tree. TREE will employ standard binary questions.
For each node we have :
================================= | SPLIT OF A NODE : 1 | ================================= LEARNING SET ======================================================== | | N(k/t) | N(k) | P(k/t) | P(t/k) | ======================================================== | wave_1 | 10.00 | 10.00 | 33.33 | 100.00 | | wave_2 | 10.00 | 10.00 | 33.33 | 100.00 | | wave_3 | 10.00 | 10.00 | 33.33 | 100.00 | ========================================================
TREE CRITERION 0.666667 =========================================================== | Ord | variable | value | criterion | =========================================================== | 1 |( 13) position 13 | 3.8100 | 0.3333 | | 2 |( 5) position 5 | 0.4400 | 0.4167 | | 3 |( 9) position 9 | 3.5200 | 0.4167 | | 4 |( 7) position 7 | 2.1100 | 0.4284 | | 5 |( 8) position 8 | 2.8600 | 0.4444 | ===========================================================The variables "position 5", "position 9","position 7" and "position 8" are alternative variables of the variable "position 13".
SPLITTING NODE: 1 VARIABLE : ( 13) position 13 SPLIT : 3.810000 CRITERION : 0.333333The first division has been performed according to the variable "position 13" and to the cut value 3.81. The first binary question is [ position 13 <= 3.81]? The objects in node 2 have answered "yes" to this question. They correspond to the 20 objects ( 10 belong the a priori class wave_1 and 10 belong wave_2. The 10 other objects have answered "no" to this question and correspond to the objects of the node 3 .
LEARNING SET =================================================== | | left node | right node | Row totals | | node | 2 | 3 | 1 | =================================================== | wave_1 | 10.00 | 0.00 | 10.00 | | wave_2 | 10.00 | 0.00 | 10.00 | | wave_3 | 0.00 | 10.00 | 10.00 | =================================================== | Total | 20.00 | 10.00 | 30.00 | ===================================================Each node has to be tested in order to decide if it should be or not a terminal node (= a leaf). Classical tests will be used for this phase: a node will be considered terminal if its size is less than a given parameter (fixed by the user), or if it contains objects belonging all to the same class of the prior partition (the node is called pure). The node 3 is a terminal node.
================================= | SPLIT OF A NODE : 3 | ================================= LEARNING SET ======================================================== | | N(k/t) | N(k) | P(k/t) | P(t/k) | ======================================================== | wave_1 | 0.00 | 10.00 | 0.00 | 0.00 | | wave_2 | 0.00 | 10.00 | 0.00 | 0.00 | | wave_3 | 10.00 | 10.00 | 100.00 | 100.00 | ======================================================== THIS STOP-SPLITTING RULE IS TRUE : The size of the no-majority classes is too small SIZE OF THE NO-MAJORITY CLASSES 0.000000 VALUE OF STOP-SPLITTING RULE 2.000000 THIS NODE IS A TERMINAL NODEWe have the list of objects assigned to the terminal node 3.
List of objects : ( 3)"wave_3_0" ( 3)"wave_3_6" ( 3)"wave_3_1" ( 3)"wave_3_7" ( 3)"wave_3_3" ( 3)"wave_3_4" ( 3)"wave_3_5" ( 3)"wave_3_2" ( 3)"wave_3_8" ( 3)"wave_3_9"
LEAF : 4 ============== ======================================================== | | N(k/t) | N(k) | P(k/t) | P(t/k) | ======================================================== | wave_1 | 10.00 | 10.00 | 90.91 | 100.00 | | wave_2 | 1.00 | 10.00 | 9.09 | 10.00 | | wave_3 | 0.00 | 10.00 | 0.00 | 0.00 | ========================================================
RULE : IF [ position 9 <= 3.800000] IS TRUE AND [ position 13 <= 3.810000] IS TRUE THEN ASSIGN_CLASS IS wave_1 r(t)= 0.090909 p(t)= 0.366667 R(t)= 0.033333
List of objects : ( 1)"wave_1_3" ( 1)"wave_1_1" ( 1)"wave_1_6" ( 1)"wave_1_9" ( 1)"wave_1_5" ( 1)"wave_1_7" ( 1)"wave_1_2" ( 1)"wave_1_8" ( 1)"wave_1_0" ( 1)"wave_1_4" ( 2)"wave_2_5"
It contains the main following information?s :
RESULTS BY SYMBOLIC OBJECT ============================================== | No | Nom |Leaf | Class | | | | No | true | assig.| ============================================== | 1 | "wave_1_3" | 4 | 1 | 1 | | 2 | "wave_1_1" | 4 | 1 | 1 | | . | ... | . | . | . | | 13 | "wave_2_7" | 5 | 2 | 2 | | 14 | "wave_2_5" | 4 | 2 | 1 (*)| | 15 | "wave_2_4" | 5 | 2 | 2 | | . | ... | . | . | . | | 29 | "wave_3_8" | 3 | 3 | 3 | | 30 | "wave_3_9" | 3 | 3 | 3 | ==============================================For each object we have the code of the terminal node or leaf which is assigned, the a priori class or true class and the assign class. If the mark (*) exists then the object is misclassify.
R(T)= 0.0333R(T) is equal to the sum of R(t) for all leaf of the tree T. R(T) represents the resubstitution estimate of the misclassification cost of the tree T.
On the test set the value R(T) gives an estimate of the misclassification cost.
CONFUSION MATRIX FOR TRAINNING SET =========================================================== | | wave_1 | wave_2 | wave_3 | Total | =========================================================== | wave_1 | 10 | 0 | 0 | 10 | | wave_2 | 1 | 9 | 0 | 10 | | wave_3 | 0 | 0 | 10 | 10 | =========================================================== | Total | 11 | 9 | 10 | 30 | ===========================================================
MISCLASSIFICATION RATE BY CLASS TRUE CLASS ( ERROR /SIZE ) FREQUENCY wave_1 ( 0 / 10 ) 0.00 wave_2 ( 1 / 10 ) 10.00 wave_3 ( 0 / 10 ) 0.00 TOTAL ( 1 / 30 ) 3.33The value 1 represents the number of objects of a priori class wave_2 which are assigned to the assign class wave_2. The assign class wave_2 is build by the rules of the decision tree.
MISCLASSIFICATION RATE BY CLASS TRUE CLASS ( ERROR /SIZE ) FREQUENCY wave_1 ( 0 / 10 ) 0.00 wave_2 ( 1 / 10 ) 10.00 wave_3 ( 0 / 10 ) 0.00 TOTAL ( 1 / 30 ) 3.33The resubstitution global error rate is 3.33 %.
================================== | EDITION OF DECISION TREE | ================================== +---- [ 4 ]wave_1 ( 10.00 1.00 0.00 ) ! !----2[ position 9 <= 3.800000] ! ! ! +---- [ 5 ]wave_2 ( 0.00 9.00 0.00 ) ! !----1[ position 13 <= 3.810000] ! +---- [ 3 ]wave_3 ( 0.00 0.00 10.00 )If the answers of the binary question[ position 13 <= 3.810000]? is false then the object is assigned to a priori class wave_3 else the result depends of the reponse of the answers position 9 <= 3.800000]?. If the reponse is false then the object is assigned to a priori class wave_2 else the object is assigned to a priori class wave_1 and the vector (10.00, 1.00, 0.00) gives the number of objects of each a priori class assigned to leaf 4 ( 10 is the number of objects of a priori class 1 "wave_1").
[2] E. Périnel (1996) Segmentation et Analyse des Données Symboliques. Application à des données probabilistes imprécises. Thèse de doctorat de l'Université Paris IX Dauphine.
[3] E. Périnel, A. Ciampi. (1997). Growing a probabilistic decision tree, AMSDA 1997 Naples, Italy.
[4] E. Périnel (1998) Construire un arbre de discrimination binaire à partir de données imprécises.RSA.