Edwin DIDAY
University Paris 9 Dauphine, Ceremade. Pl. Du Ml de L. de Tassigny. 75016 Paris, FRANCE
Introduction
Knowledge extraction from large data bases is our main aim as in "Data Mining". The data descriptions of the units are called "symbolic" when they are more complex than the standard ones due to the fact that they contain internal variation and are structured. Symbolic data happen from many sources, for instance in order to summarise huge sets of data. They need more complex data tables called "symbolic data tables" because a cell of such data table does not necessarily contain as usual, a single quantitative or categorical values. For instance, a cell can contain, a distribution (Schweizer (1984) says that "distributions are the number of the future"!), or several values linked by a taxonomy, or intervals with logical rules, etc.. The need to extend standard data analysis methods (exploratory, clustering, factorial analysis, discrimination,...) to symbolic data table is increasing in order to get more accurate information and summarise extensive data sets contained in Data Bases. We define "Symbolic Data Analysis" (SDA) as the extension of standard Data Analysis to such tables. "Extracting knowledge" means getting explanatory results, that why, "symbolic objects" are introduced. They constitute an explanatory output of a SDA and moreover they can be used in order to define queries of a Data Base.
Now, we try to look for the historical and practical origin of the Symbolic Data Analysis field.
The Aristotle Organon (IV B.C.) clearly distinguishes "first order individuals" (as a horse or a man) considered as a unit associated to an individual of the world, from "second order individuals" (as the horse or the man) also taken as a unit associated to a class of individuals.
* This introduction is the first chapter of a book describing the methods
involved in the European Eurostat Project SODAS.
Our first aim is to extend standard data analysis to second order individuals. For instance, in a census of a country, each individual of each region is described by a set of numerical or categorical variables given in several relations of a Data Base.
Such individual is considered as a "first order individual". In order to study the regions considered as "second order individuals", we can describe each of them in summarising the values taken by its inhabitants,
by inter-quartile intervals, or subsets of categorical values, or histograms or probability distributions, etc. depending on the concerned variable. In such a way, we obtain a "symbolic data table" where each row defines the "description" of a region and each column is associated to a symbolic variable. An extension of standard Data Analysis to such data table is the first aim of what we have called "Symbolic Data Analysis".
Another important aim is to obtain (or "mine") explanatory results (i.e. knowledge) by extracted, the so called "symbolic objects" which modelize a "concept" or a "physical entity" of the real world. A "symbolic object" is defined by its "intent" and by a way of finding its "extent". For instance, the description of a region is called "intent", the set of individuals which satisfy this intent is called "extent". The syntax of symbolic objects must have an explanatory power. For instance, the symbolic object defined by the following expression (see section 4, for a formal definition): a(w) = [age(w) Î [30, 35] ] Ù [Number of children(w) £ 2], gives at the same time:
i) the intent of a class of individuals by the description d = ( [30, 35] , 2), where [30, 35] is the inter-quartile interval of the random variable associated to the region for the variable age,
ii) a way of calculating the extent by the mapping "a" defined with the help of the relations Î and £ . It means that an individual "w" satisfies this intent (i.e. belongs to the "extent") if his age is between 30 and 35 years old and he has less than 2 children.
This very simple kind of symbolic object can be extended at least in the following way: the individuals are of second order (as towns or regions) and represent classes of individuals of first order; therefore the descriptions of the individuals are provided by distributions (the histogram of the age in a town, for instance). In this case we have to define a different kind of relation "R" and a threshold in order to calculate the extent.
There are several advantages in the use of symbolic objects, one of them, is their ability to be translated in a query of a Data Base and therefore to propagate the concepts that they describe from one data base to another database (i.e. from a country to another country). What do we call a "concept"? There are two kinds of "concepts".
i) The "concepts of the real world" as a town, a region, a scenario of road accident, a kind of unemployment, .... That kind of concept is defined by an "intent" and an "extent" notions brightly defined by Arnault and Nicole (1662) in the framework of Port-Royal school:
"Now, in these universal ideas there are two things which is important to keep quite distinct: comprehension and extension. I call the comprehension of an idea the attributes which it contains and which cannot be taken away from it without destroying it; thus the comprehension of the idea of a triangle includes, to a superficial extent, figure, three lines, tree angles, the equality of these three angles to two right angles to two right angles, etc. I call the extension of an idea the subjects to which it applies, which are also called the inferiors of a universal term, that being called superior to them. Thus the idea of triangle in general extends to all different kinds of triangle".
ii) The "concepts of our mind" (among the so called "mental objects" as defined by J.P. Changeux (1983)) which represents in our mind concepts of the real world by their intent and a "way of computing their extent" and not the extent itself as (for sure!) there is no room for all the possible extents. A concept of our mind can be mathematically modelized by a symbolic object which is defined by a description "d" (i.e. its intent) and a mapping "a" able to compute its extent , for instance, the description of what we call a "car" and a way of recognizing that a given entity of the real world is a car. A concept of the real world can be modelized by a symbolic object and its extent. Whereas, "concepts" or "entities" of the real world are mathematically modelized by "symbolic objects", their computing modelization is provided by the so called "objects" used in the "object oriented language" and for instance, in computer languages as C++ or JAVA.
In the Aristotelian tradition, concepts are characterized by logical conjunction of properties. In the Adansonian tradition (Adanson (1727-1806) was a french naturalist very much ahead of his time), a concept is characterized by a set of similar individuals. In contrast , with the Aristotelian tradition, were all the members of the extent of a concept are equivalent, a third tendency derived from psychology and cognitive science Rosch (1978), is to consider that concepts must be represented by classes which "tend to become defined in terms of prototyped or prototypical instances that contain the attributes most representative of items inside the class" .Wille (1981), following Wagner (1973), says as "in traditional philosophy things for which their intent describes all the properties valid for the individual of their extent are called "concept". Symbolic objects combine the advantages of these four tendencies:
. The Aristotelian tradition as they can have the explanatory power of a logical description of the concepts that they represent.
. The Adansonian tradition as the members of the extent of a symbolic object are similar in the sense that they must satisfy at the best the same properties (not necessarily Boolean). In that sense the concepts that they represent are polytheistic.
. The Rosch point of view, as their membership function is able to provide prototypical instances characterized by the most representative attributes.
. The Wille property is satisfied by the so called "complete symbolic objects" which can be proved that they constitute a Galois lattice (see for instance, Diday (1998)).
Symbolic Data Analysis is born from the simultaneous influence of several fields , from:
- standard exploratory data analysis (Tuckey(1958), Benzécri (1973), Diday et al (1984) , Saporta (1990), Lebart et al (1998)) where more importance is given to individuals then in standard statistics and where the symbolic approach extend the methods to more complex descriptions of the units and give more explanatory results .
- Artificial Intelligence (AI) where much efforts has been devoted in finding good languages in order to represent complex knowledge instead of the simple IR p vectors of the standard statistical units. Notice that the simple language used in order to represent symbolic objects is
more inspired from languages based on first order logic ((Michalski (1973), Hayes Roth and McDermot (1977)) than from graph representation (Winston (1979), Sowa (1984)). Notice also, that in symbolic data analysis we are not much interested in the computer language (SQL, C++, JAVA, ...) used in order to represent symbolic objects but much more by their mathematical modelization, the way of inducing them from the data, their graphical representation, etc.
- Numerical Taxonomy in biology, Learning Machine in AI, Classification in Data Analysis.
In all these fields a natural question arose: how does one obtain classes and their description? Historically, we may say briefly that there are three tendencees:
The first proposed by A. de Jussieu (1748) is in the Aristotelician tradition and consists in defining top down the classes by a good choice of the properties which characterize them and from the most general to the most specific. In that way we obtain a decision tree where each node is characterized by a conjunction of properties. Many others have continued this tendency. By starting from individuals of first order: Belson (1959), Morgan and Sonquist (A.I.D. program (1963)), Lance and Williams (1967), Breiman and al. (1984), Quinlan (1986). By starting from individuals of second order: Pankurst (1978), Payne (1975), Gower (1975), J. Lebbe ,R. Vigne (1991), H. Ralambondrainy (1991), Ganascia (1991).
The second tendency, put forward by Adanson (1757) who gave the first "Sequential Agglomerate Hierarchical Clustering" (SAHC) algorithm. This well known "bottom up" algorithm, starting by classes reduced to individuals, merges at each stage the most "similar" classes. This tendency is well represented by Ward (1963), Lerman (1970), Jardine and Sibson (1971), Sneath and Sokhal (1973), Jambu (1978), Roux (1985), Bock (1974 ), Celeux, Diday, Govaert, Lechevallier and Ralambondrainy(1989), etc. The classes obtained in this way contain similar objects. It is then possible to generalize them in terms of disjonction of cojunctionof properties, that why these classes are called "polytheistic" in opposition with classes generalized by a cojunction of properties and called "monotheistic". Whereas, the first tendency yields monotheistic classes by a top-down process, the second produces polythetic classes by a "bottom up" process. In this framework, a family of methods called "Conceptual Clustering" was developed in the eighties such as Langley and Sages (1984), Lebowitz (1983), Fisher D.H. (1987) and Fisher and Langley (1986) for a review. Instead of producing trees, in Diday (1984), Bertrand (1986) for instance, an ascending process building a pyramid (i.e. a generalization of hierarchical trees, allowing overlapping clusters) of polytheistic classes is described. In Brito and Diday (1991), Brito (1994) an ascending pyramid produces monotheistic classes.
The third tendency consists in looking directly for classes and their representation. For instance, the "Dynamic Clustering Method" (Diday (1971), Diday and al (1979)), Diday and Simon (1976)), defines a general framework and algorithms which aim to discover simultaneously classes and their representations in such a way that they "fit" together as well as possible. This approach has been used with several kinds of inter-class structure (partitions, hierarchies, ...) and representation modes for each class (seeds, probability laws, factorial axis, regressions,...). In Diday (1976), a logical representation of clusters is proposed. With regards to the "Conceptual Clustering" algorithm based on the Dynamic Clustering Method or inspired by it, mention should be made of Diday, Govaert, Lechevallier, Sidi (1980), Michalski, Diday, Step (1982), Michalski, Stepp (1983) among other pioneers papers in "Conceptual Clustering".
Since the first papers announcing the main principles of Symbolic Data Analysis ((Diday (1987 a) , (1987 b), (1989)) many work have been done in the same direction. In factorial analysis, P. Cazes, A. Chouakria, E. Diday, Y. Schecktman (1997) defined a principal component analysis of individuals described by a vector of numerical intervals and in the same direction R. Verde, F.A.T. De Carvalho (1998) by taking care on given dependance rules (see also Lauro, Palumbo (1998) and the section 9.3 in this book). In the case where the individuals are described by symbolic data, Conruyt (1993) in the case of structured data, Ciampi, E. Diday, J. Lebbe, E. Périnel, R. Vigne (1995), Périnel (1996), developed an extension of standard decision trees. In the same direction E. Perinel presents in this book his work on "symbolic discrimination rules" , M.C. Bravo, J.M. Garcia-Santesmases (1998) on "segmentation trees for stratified data" and J.P. Rasson and S. Lissoir(1998) a Kernel discriminante analysis starting from a dissimilarity between symbolic descriptions. See also E. Auriol (1995) for a link with the domain of "Case Based Reasonning". In order to select the symbolic variables which discriminate at the best the individuals or classes of individuals, several works were achived as R. Vignes (1991) and more recently Ziani (1996). It is often useful to calculate dissimilarities between symbolic objects; in that direction mention should be made of C. Gowda and E. Diday (1992), De Carvalho (1994, 1998 a). If each cell of the data table is a random variable represented by a histogram (for instance, the histogram of the inhabitant age of a town), a histogram of histogram can be calculated for instance, by taking care of rules between the variables values in De Carvalho (1998 b) and Bertrand, Goupil in chapter 6 of this book, or by using the capacity theory in Diday, Emilion ((1995, 1997), Diday, Emilion, Hillali (1996). Noirhomme and Rouard (1998) give a way of representing multidimensional symbolic data (see chapter 7), see also E. Gigout (1998) .
Starting from standard data, Gettler-Summa (1992), Smadhi (1995) proposed a way for extracting symbolic objects from a factorial analysis; in order to extract symbolic objects from a partition, see Stephan, Hebrail, Lechevallier in chapter 5 and Gettler-Summa (1997) in section 9.4 of this book. Starting from time-series, Ferraris, Gettler-Summa, C. Pardoux, H. Tong (1995), have defined a way for providing symbolic objects (see chapter 12) .
More recently, several dissertations have been presented in the Paris 9 - Dauphine University. Mfoumoune (1998) for the sequential building of a pyramid where each node is associated with a symbolic object. Chavent (1998), in order to build a partition of a set of symbolic objects by a top-down algorithm which provides also a symbolic object associated with each obtained class (see chapter 11). Stéphan (1998) for extracting symbolic objects from a data base (see chapter 5). Hillali (1998) for describing classes of individuals described by a vector of probability distributions. Pollaillon (1998), for extending Galois lattices to symbolic data at input and "complete" symbolic objects at output (see section 11.4). More generally, the most recent algorithms in Symbolic Data Analysis are in this book.
2) The input of a symbolic data analysis: a "symbolic data table"
"Symbolic data tables" constitute the main input of a Symbolic Data Analysis. Their columns are " variables " which are used in order to describe a set of units called "individuals". Rows are called " symbolic descriptions " of these individuals because they are not as usual, only vectors of single quantitative or categorical values. Each cell of this " symbolic data table " contains data of different types:
(a) Single quantitative value : for instance, if " height " is a variable and w is an individual : height(w) = 3.5. (b) Single categorical value: for instance, Town(w) = London.
(c) Multivalued: for instance, in the quantitative case height(w) = {3.5, 2.1, 5} means that the height of w can be either 3.5 or 2.1 or 5. Notice that (a) and (b) are special cases of (c).
(d) Interval: for instance height(w) = [3, 5], which means that the height of w varies in the interval [3, 5].
(e) Multivalued with weights: for instance a histogram or a membership function (notice that (a) and (b) and (c) are special cases of (e) when the weights are equal to 1).
Variables can be: (g) Taxonomic: for instance, " the colour is considered to be "light" if it is "yellow", "white" or "pink" . (h) Hierarchically dependent : for instance, we can describe the kind of computer of a company only if it has a computer, hence the variable "does the company has computers? " and the variable " kind of computer" are hierarchically linked.
(i) With logical dependencies, for instance: " if age(w) is less than 2 months then height(w) is less than 10 ".
Many examples of such symbolic data are given in the chapter 3 of this
book. Table 1 gives some examples of such data:
|
|
|
|
||
|
|
|
|
||
|
|
|
|||
|
|
|
|||
|
|
||||
Table1. A "symbolic data table": each cell contains an example of "symbolic data".
3) Sources of Symbolic Data:
Symbolic data are generated when we summarise huge sets of data. The need of such summary can appear by different ways, for instance from any query to a data base which induces categories and descriptive variables. These categories can be for instance, simply the towns or in a more complex way, the socio-professional categories (SPC) crossed with categories of age (A) and regions (R). Hence, in this last case, we obtain a new categorical variable of cardinality ú SPCú .ú Aú .ú Rú where ú Xú is the cardinality of the set X. The descriptive variables of the households can then be used in order to describe these categories by symbolic data. Symbolic Data can also appear after a clustering in order to describe in an explanatory way (by using the initial variables) the obtained clusters.
Symbolic data may also be "native" in the sense that they result from expert knowledge (scenario of traffic accidents, type of emigration, species of insects, ...), from the probability distribution , the percentiles or the range of any random variable associated to each cell of a stochastic data table, from time series (in representing each time Seri by the histogram of its values or in describing intervals of time), from confidential data (in order to hide the initial data by less accuracy), etc. They result also, from Relational Data Bases, in order to study a set of units whose description needs the merging of several relations.
Example: We have two relations of a Relational Data Base defined as follows. The first one called "delivery" is given in table 1. It describes five types of deliveries characterised by the name of the supplier, its company and the town from where the supplying is coming.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1 Relation "Delivery"
The supplying are described by the relation "Supplying" defined in the following table 2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i) Multivalued: this happen when the variables "Supplying" and "Town" have several values as shown in the table 3.
ii) Multivalued with weights: this is the case for the towns of the supplier F1. The weights ½ means that the town of the supplier F1 is Paris or Lannion with a frequency equal to ½.
iii) Rules: some rules have to be given as input in addition to the data table 3. For instance, "if the town is Paris and the supplier is CNET, then the supplying is FT1.
iv) Taxonomy: by using regions we can replace for instance {Paris, Clamart} by " Parisian Region ".
4) Main output of Symbolic Data Analysis algorithms:
Most of these algorithms give in their output the description "d" of a class of individuals by using a "generalization" process, which gives also a way, by starting with this description, to find at least, the individuals of this class.
More formally, let W be a set of individuals, D a set containing descriptions of individuals or of classes of individuals, " y " a mapping defined from W into D which associates to each w Î W a description d Î D from a given symbolic data table. We denote by R, a relation defined on D. It is defined by a subset E of D´ D. If (x,y)Î E we say that x and y are connected by R and this is denoted by xRy . The characteristic mapping of R is
hR: D´ D ® {0,1} such that hR(x,y) = 1 iff x R y. We generalise the definition of the mapping hR by introducing the mapping HR: DxD ® L. We will write [d' R d] = HR(d', d) were HR(d', d) is the result of the "comparison" between d' and d by HR. When L ={true, false}, [d' R d] = true means that there is a connection between d and d'. In the case where L= [0, 1], the value [d'Rd] measure the degree of connexion between d and d'. In this case, [d' R d] can be interpreted as the "true value" of x R y or " the degree to which d' is in relation R with d" (see in Bandemer and Nather (1992), the section 5.2 on fuzzy relations).
For instance, R may be one of the following relations {=, º , £ , Í } or an implication, or a kind of matching, etc. R can also use a set of such operators.
The description of an individual, is called " individual description ". The description of a class of individuals is an "intensional description". For instance, the description of a scenario of accidents, of a class of failures, etc. is an intensional description. A " symbolic object " is defined both by a description "d" (generally, intensional) and a way of comparing it to individual descriptions defined by a mapping "a" called "membership function". More formally,
Definition of a Symbolic Object
A symbolic object is a triple s = (a, R, d) where R is a relation between descriptions , "d" is a description and "a" is a mapping defined from W in L depending on R and d.
Symbolic Data Analysis in SODAS concerns usually classes of symbolic objects where R is fixed, "d" varies among a finite set of coherent descriptions and a(w) = [y(w) R d]. More generally, many other cases can be considered if for instance the mapping "a" is of the following kind: a(w) = [ he (y(w)) hJ (R) hi (d)] where the mappings he , hJ and hi are "filters" which will be discussed hereunder. There are two kinds of symbolic objects:
- " Boolean symbolic objects " if [y(w) R d] Î L = {true, false}. In this case, if y(w) = (y1,...,yp), the yi are of type (a) to (d), defined in section 1.
Example:
Let a(w) = [y(w) R d] with R defined by [ d' R d ] = Úi =1, 2 [ d' i R i d i ] where Ú has the standard logical meaning and Ri is the relation of set inclusion (i.e. Í ). If y(w) = (colour(w), height(w)), d = ({red, blue, yellow}, [10,15] ) and u be an individual such that colour(u) = {red, yellow}, height(u) = {21}, then
a(u) = [colour(u) Í {red, blue, yellow}]Ú [height(u) Í [10,15]]= true Ú false = true.
- " Modal symbolic objects " if [y(w) R d] Î L = [0,1].
Example:
Let be a(u) = [y(u) R d] where for instance R is defined by [ d' R d ] = Max i =1, 2 [ d' i R i d i ] and where the "matching" of two probability distributions is defined for two discrete probability distributions d' i = r and d i = q of k values by: r Ri q = åj=1,k r j q j exp (r j - min (r j, q j)). By analogy with the boolean case we denote [ d' R d ]= Ú * i =1, 2 [ d' i R i d i ] where Ú * = Max . With these definitions it is possible to calculate the mapping "a" of a symbolic object s = (a, R,d) where SPC means " socio-professional-category " and d = ({(0.2)12, (0.8) [20 ,28]}, {(0.4)employee, (0.6)worker}) by:
a(u) = [age(u) R1{(0.2)12, (0.8) [20 ,28]}] Ú * [SPC(u) R2{(0.4)employee, (0.6)worker}]
Notice that in this example the weights (0.2), (0.8), (0.4), (0.6) represent frequencies but more generally other kinds of weights may be used as "possibilities", "necessities", "capacities", etc. (see Diday (1995), for instance).
Syntax of symbolic objects in the case of "assertions":
If the initial data table contains p variables we denote y(w) = (y1(w),..., yp (w)), D = (D1,...,Dp), d Î D: d = (d1,..., dp) and R' = (R1,...,Rp) where Ri is a relation defined on Di. We call " assertion " a special case of a symbolic object defined by s = (a, R, d) where R is defined by
[ d' R d ] = Ùi =1, p [ d' i R i d i ] where "Ù " has the standard logical meaning and "a" is defined by: a(w) = [ y(w) R d] in the boolean case. Notice that considering the expression a(w) = Ùi =1, p [ yi (w) R i d i ] we are able to define the symbolic object s = (a, R, d). Hence, we can say that this explanatory expression defines a symbolic object called "assertion".
For example, a Boolean assertion is:
a(w) = [age(w) Í {12, 20 ,28}] Ù [SPC(w) Í {employee, worker}]. If the individual u is described in the original symbolic data table by age(u)={12, 20} and SPC (u) = {employee } then: a(u) = [{12, 20 }Í {12, 20 ,28}] Ù [{employee}Í {employee, worker}]= true.
In the modal case, the variables are multivalued and weighted, an example is given by
a(u) = [y(u) R d ] with [d' R d ] = f({[yi(w) Ri di]}i=1,p) where for instance,
f({[yi(w) Ri di]}i=1,p) = P i =1, 2 [ d' i R i d i ] where in case of probability distributions, the "matching" is defined for two discrete density distributions d' i = r = (r1, ...,r k) and
d i = q = (q1, ... , qk) of k values by: r Ri q = åj=1,k r j q j e (r j - min (r j, q j)).
By analogy with the Boolean case we denote [ d' R d ]= Ù *i =1, 2 pi [ d' i R i d i ] where the meaning of "Ù *" is given by the definition of the mapping "f". For instance, with these choices, a modal assertion s = (a, R, d) is completely defined by the equality:
a(w) = [age(w) R1 {(0.2)12, (0.8) [20 ,28]}] Ù * [SPC(w) R2 {(0.4)employee, (0.6)worker}]
Extent of a symbolic object s: in the Boolean case, the extent of a symbolic object is denoted Ext(s) and defined by the extent of a, which is: Extent(a) = {w Î W / a(w) = true}. In the modal case, given a threshold a , it is defined by Exta (s)= Extenta (a)= {w Î W / a(w) ³ a }.
Other possible classes of symbolic objects: if for instance the mapping "a" is of the following kind: a(w) = [he (y(w)) hJ (R) hi (d)], different classes of symbolic objects may be defined depending on the choice of he, he and hi. In practice, these mappings may be used for instance, in the following way: he is a filter of the extension of the symbolic object, hJ is a filter of the descriptive variables and hi is a filter on the descriptions. More details may be found in Diday (1998) and in this book in chapter 3. The following example illustrate a kind of filter.
Example of filter on the extension:
We associate to each town a symbolic object defined by a(w) = [ he (y(w)) R d] where "d" is the description of its inhabitant by using for instance, the histogram associated to each variable (as the histogram of the age). In order that the extension of such symbolic object contains only members of its associated town, the mapping he is defined in the following way: he (y(w)) = y(w) if w is member of the town and if not he (y(w)) = HS where HS is a dummy value such that [ he (y(w)) R d] = 0 for any description d.
Order between symbolic objects: if r is a given order on D, then the induced order on the set of symbolic objects denoted by rs is defined by : s1 rs s2 iff d1 r d2.
If R is such that [d R d']= true implies d r d', then Ext(s1) Í Ext(s2) if s1 rs s2 . If R is such that [d R d']= true implies d' r d then Ext(s2) Í Ext(s1) if s1 rs s2.
Tools for symbolic objects: Tools between symbolic objects (Diday (1995)) are needed such as similarities (F. de Carvalho (1998), Esposito et al (1998)), matching, merging by generalisation where a t-norm or a t-conorm (Schweizer, Sklar (1983) and Diday, Emilion (1995), (1997)) denoted T can be used, splitting by specialisation (Ciampi et al. (1995)). Under some assumption on the choice of R and T it can be shown that the underlying structure of a set of symbolic objects is a Galois lattice (Brito(1994), Polaillon, Diday (1997), Polaillon (1998) ), where the vertices are closed sets defined by " complete symbolic objects ". More precisely, the associated Galois correspondence is defined by two mappings F and G:
-F: from P(W ) (the power set of W ) into S (the set of symbolic objects) such that F(C) = s where s = (a, R, d) is defined by d = TcÎ C y(c) and so a(w) = [y(w) R TcÎ C y(c)], for a given R. For example, if TcÎ C y(c) = ÈcÎ C y(c) , R º " Í ", y(u) = {pink, blue}, C = {c, c?}, y(c) = {pink, red}, y(c?) = {blue, red}, then a(u) =[y(w) R TcÎ C y(c)] = [{pink, blue}Í {pink, red}È {blue, red}})={pink, red, blue}] = true and uÎ Ext (s).
-G: from S in P(W ) such that: G(s) = Ext (s).
A " complete symbolic object " s is such that F(G(s)) = s. Such objects can be selected from the Galois lattice but also, from a partitioning, a hierarchical or a pyramidal clustering, from the most influential individuals to a factorial axis, from a decision tree, etc.
Finally we can summarize the mathematical framework of a symbolic data analysis in the following way (figure 1):
Figure1. W : set of individuals. D: description set. L= {tue, false} or L= [0,1]. S: set of symbolic objects. y: description function. a: membership function from W in L= {tue, false} or L= [0,1]. R: comparison relation. T: generalization mapping. F: intension mapping, G: extension mapping. dw : y(w) = dw is an individual description. ws : ws = F(w) = (a, R, y(w)) is an individual symbolic object. dC :description of class C. s: intensional symbolic object given by F(C) = (a, R ,dC). G(s) is the extension of s.
5) Some advantages in the use of symbolic objects
We can observe at least fifth kinds of advantages in the use of symbolic objects. First, they give a summary of the original symbolic data table in an explanatory way, (i.e. close to the initial language of the user) by expressing descriptions based on properties concerning the initial variables or meaningful variables (such as factorial axes). Second, they can be easily transformed in term of query of a Data base. Third, by being independent of the initial data table they are able to identify any matching individual described in any data table. Fourth, in the use of their descriptive part, they are able to give a new symbolic data table of higher level on which a symbolic data analysis of second level can be applied. Fifth , in order to characterize a concept, they are able to join easily several properties based on different variables coming from different arrays and different underlying populations.
6) Some symbolic data analysis methods
Symbolic Data Analysis methods are mainly characterized by the following principle:
i) they start as input with a symbolic data table and they give as output a set of symbolic objects. These symbolic objects give explanation of the results in a language close from the one of the user and moreover have all the advantages mentioned in 5).
ii) They use efficient generalization processes during the algorithms in order to select the best variables and individuals.
iii) They give graphical descriptions taking account on the internal variation of the symbolic objects.
The following methods are developed in this book and in the SODAS software:
- Principal Component and Discriminate Factorial Analysis of a symbolic data table. The output of these methods preserves the internal variation of the input data in the sense that the individuals are not represented in the factorial plane by a point as usual but by a rectangle which allows the definition of a symbolic object with explanatory factorial axes as variables.
- extension of elementary descriptive statistic (central object, histograms, dispersion, co-dispersion, etc. from a symbolic data table) to symbolic data.
- mining symbolic objects from the answers to queries of a relational data base ,
- partitioning, hierarchical or pyramidal clustering of a set of individuals described by a symbolic data table such that each class be associated to a complete symbolic object.
- dissimilarities between Boolean or probabilistic symbolic objects,
- extension of decision trees on probabilistic symbolic objects, extension of a Parzen discrimination method to classes of symbolic objects,
- generalisation by a disjunction of symbolic objects of a class of individuals described in a standard way.
- inter-active and ergonomic graphical representation of symbolic objects.
7) Symbolic Data Analysis in the SODAS software
7.1 The general aim
The general aim of SODAS can be stated in the following way: building
symbolic data in order to summarise huge data sets and then, analyse them
by Symbolic Data Analysis. For instance, if a set of households is characterized
by its region, its socio-economic group, the number of bedrooms and of
dining-living, we obtain a data table of the kind of table 4:
| Household number | Region | Bedroom | Dining-Living | Socio-Econ group |
| 11404 | Northern- Metropolitan | 2 | 1 | 1 |
| 11405 | Northern- Metropolitan | 2 | 1 | 3 |
| 11406 | Northern- Metropolitan | 1 | 3 | 3 |
| 12111 | Northern- Metropolitan | |||
| 12112 | East anglia | 1 | 3 | 3 |
| 12112 | East anglia | 2 | 2 | 1 |
| 12112 | Greater London N-E | 1 | 2 | 3 |
Table 4 : Standard Data Table of Households
In census data there is a huge set of households, we can summarize them
by describing each region by the households of their inhabitants. In order
to do so, we delete the first column of this table and we obtain the table
5:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
North-East |
Table 5: The first column of table 4 concerning the household number has been deleted.
We can now describe each town by the histogram of the categories of each variable. This is done in table 6 which is a symbolic data table as each cell contains a histogram and not a quantitative or categorical number as in the standard data tables. It is easy to see, for instance that a decision tree will not be the same if the variables are categories (each cell of the associated data table contains a frequency) and if the variable are symbolic (in this case each cell contains a histogram). If in the first case each branch of the decision tree represents an interval of frequency (for instance, the frequency of the category [20, 30] years old), whereas in the second case it represents an interval of values ( for instance the interval [0, 30] years old).
| Region | Bedroom | Dining-Liv | Socio-Ec gr |
| Northern- Metropolitan | (2\3) 2, (1\3) 3 | (2\3) 1, (1\3) 3 | (1\3) 1, (2\3) 3 |
| East-anglia | (2\3) 1, (1\3) 2 | (2\3) 2, (1\3) 3 | (2\3) 1, (1\3) 2 |
| Greater London |
Table 6 A symbolic data table where each cell contains a histogram
The main steps for a symbolic data analysis in SODAS can then be defined as following:
If there is more than one data table, put the data in a relational data base(ORACLE, ACESS, ...). Then define a context by giving: the units (individuals, households,...), the classes (regions, socio-economics groups,...), the descriptive variables of the units. Then, build a symbolic data table where the units are the preceding classes, the descriptions of each class is obtained by a histogram as in table 6 or by a generalization process applied to its members. Finally, apply to this symbolic data table, symbolic data analysis methods (histogram of each symbolic variable, dissimilarities between symbolic descriptions, clustering, factorial analysis, discrimination of a symbolic data table, graphical visualisation of symbolic descriptions,...).
7.2 Examples of applications strategy in SODAS:
We start from data provided by the three Statistical institute involved in SODAS (ONS (England), INE (Portugal), EUSTAT (Basque Country (Spain)), as household consuming, census, labour force survey or road transportation. Units are for instance, defined as "regions" or as " unemployment type " defined by each category of a new variable as " unemployment people categories x age categories x country" gives by a query to the relational data base. Then, DB2SO associates to each unit a symbolic description . Hence, we get a symbolic data table on which symbolic data analysis methods can be applied. In order to summarize and to get an overview on this symbolic data table, we can for instance, apply the following steps: we apply DIV (see chapter 11 ) which provides classes of units. It is then possible to apply again DB2SO on the same units but with the classes given by DIV. Therefore, each class represents a set of regions or a set of unemployment type. Hence, we obtain a new symbolic data table where each unit represents one of these classes. Several symbolic data analysis methods can then be applied: for instance, a principal component analysis (PCA, see chapter 9) in order to get a graphical overview on these classes, a graphical visualisation of each class by "stars" (see chapter 7), a description of each class by a disjunction of assertions (DSD, see section 9.4 ), etc.
7.3 SODAS software overview
In figure 2 an overview on the SODAS software is given. The input of DB2SO (see chapter 5) is a query to a data base. Its output is a symbolic data table. Having obtained this data table any symbolic data analysis method can be applied.
Figure 2: A SODAS software overview
7.4 SODAS future
The next steps in the future of SODAS will mainly consists first to extract symbolic objects from the clustering, factorial analysis, decision tree or discrimination (standard or symbolic) methods. Second, to induce from these symbolic objects, a new symbolic data table in order to study them, by a symbolic data analysis of higher level. Third, to select the "best" symbolic objects and prototypes, by using good criteria . Fourth, to propagate the obtained symbolic objects (the concepts that they represent). This propagation can be done towards the same Data Base for instance, at different times (in order to study the time evolution of the retained concepts) or towards other data bases associated to different countries. In any case, we have to compare sets of concepts and their associated symbolic objects obtained from different data bases. This may be done in several ways. For instance, by looking for a consensus tree or pyramid, between the concepts obtained in two different countries. Among many other ways, we can also calculate the extent of the symbolic objects obtained from a country in another country and then comparing the concepts associated to the symbolic objects of the first country to the concepts of the second country induced by the "complete symbolic objects" obtained from these extension. An overview on the next steps for the research and development of SODAS project are given in figure 3.
Figure 3 The future for the research and software development of the SODAS project.
Conclusion
The need to extend standard data analysis methods (exploratory, clustering, factorial analysis, discrimination,...) to symbolic data tables in order to extract new knowledge, is increasing due to the expansion of information technology, now able to store an increasing amount of huge data sets . This need, has led to a new methodology called "symbolic data analysis" whose aim is to extend standard data analysis methods (exploratory, clustering, factorial analysis, discrimination, decision trees,...) to new kind of data table called "symbolic data table" and to give more explanatory results expressed by real world concepts mathematically represented by easy readable "symbolic objects". The aim of the European Community project called SODAS for a " Symbolic Official Data Analysis System" in which 15 institutions of 9 European countries are concerned is to produce a first software of Symbolic Data Analysis. Three Official Statistical Institutions are involved in this project: EUSTAT (Basque Country (Spain)), INE (Portugal) and ONS (England). An example of future application proposed on their Census data consists in finding clusters of unemployed people and their associated mined symbolic objects in a country, calculating its extent in the census of another country and describing this extent by new symbolic objects in order to compare the behaviour of the two countries. In that way, several new theoretical development are needed as the selection and the stochastic convergence of symbolic objects . Also, as the consensus and contrast between set of symbolic objects and their associated concepts extracted from different data bases. New software development are also needed as a tool in order to be able to transform a symbolic object extracted from a data base in a query of this data base or of another data base. This new tool may be called SO2DB as it is complementary to the actual DB2SO. Moreover, the next steps will be to improve the actual methods explained in this book (robustness, validity of the results etc.) and extend the symbolic data analysis methodology to regression, multidimensional scaling, neural network etc.
References
Adanson M. (1757) "Histoire Naturelle du Sénégal- Coquillages". Bauche Paris.
Aristotle (IV BC) "Organon" Vol. I Catégories, II De l'interprétation. J. Vrin edit. (Paris) (1994).
Arnault A., Nicole P. (1662), "La logique ou l'art de penser", Froman, Stutgart (1965).
Auriol E. (1995) "Intégration d'approches symboliques pour le raisonnement à partir d'exemples" Thèse de doctorat, Université Paris 9 Dauphine.
Barbut M., Monjardet B. (1971°, "Ordre et classification", T.2 Hachette, Paris.
Belson (1959), "Matching and prediction on the principle of biological classification", Applied Statistics, vol. VIII.
Benzecri J.P. et al. (1973) "L'Analyse de Données", Dunod, Paris.
Bertrand P. (1986) "Etude de la représentation pyramidale", Thèse de 3° cycle, Université Paris IX-Dauphine.
Bock H.H. (1974) "Automatische Klassifikation". Vandenhoeck and Ruprecht, Gottingen.
Bravo C., Garcia-Santesmas (1998) J. "Symbolic objects description of strata by segmentation trees". Proc. NTTS. Ph. Nanopoulos , Garonna, Lauro edit, Eurostat, Sorrento (Italy).
Breiman L., Friedman J.H., Olsken R.A., Stone C.S. (1984) "Classification and regression trees", Belmont, Wadsworth.
Brito P., Diday E. (1991) "Pyramidal representation of symbolic objects" NATO ASI Series, Vol. F 61. Proc. Knowledge Data and computer-assisted Decisions. Schader and Gaul edit. Springer-Verlag.
Brito P. (1994) "Order structure of symbolic assertion objects". IEEE TR. on Knowledge and Data Engineering Vol.6, n° 5, October.
Bandemer H., Nather W. (1992) "Fuzzy Data Analysis". Kluwer Academic Publisher.
Cazes P., Chouakria A., Diday E., Schecktman Y.(1997)) "Extension de l'Analyse en Composantes Principales à des données intervalles". Revue de Statistiques Appliquée, vol. XXXVIII, n°3, 1990,pp 35-51.
Celeux G., Diday E., Govaert G., Lechevallier Y., Ralambondrainy H. (1989), "Classification Automatique: environnement Statistique et Informatique". Dunod.
Changeux J.P. (1983) "L'homme neuronal". Fayard, Collection Pluriel.
Chavent M. (1997) "Analyse des Données symboliques. Une méthode divisive de classification". Thèse de doctorat, Université Paris 9 Dauphine.
Ciampi A., Diday E., Lebbe J., Périnel E., Vigne (1995) R. " Recursive partition with probabilistically imprecise data". OSDA'95. Editors: Diday, Lechevallier, Opitz Springer Verlag (1996).
Conruyt N. (1994) "Amélioration de la robustesse des systèmes d'aide à la description, à la classification et à la détermination des objets biologiques. Thèse de doctorat, Université Paris 9 Dauphine.
De Carvalho F.A.T. (1998) a a "New metrics for constrained boolean symbolic objects" Proc. KESDA'98, Eurostat. Luxembourg.
De Carvalho F.A.T. (1998) b "Statistical proximity functions of boolean symbolic objects based on histograms" IFCS, Roma, Springer-Verlag.
Diday E.(1971) "La méthode des nuées dynamiques" ; Revue de Statist. Appliquée. Vol XIX, n° 2, pp. 19-34.
Diday E.(1976) "Sélection typologique de variables". Rapport INRIA. Rocquencourt 78150, France.
Diday E.(1976) "Cluster analysis" in K.S. Fu (ed.). Digital Pattern Recognition. Springer Verlag. PP. 47-94.
Diday E. et al. (1979) "Optimisation en classification automatique". INRIA edit. Rocquencourt 78150, France.
Diday E., Govaert G., Lechevallier Y., Sidi J. (1980) "Clustering in Pattern Recognition". Proceed. NATO Adv. Study Institute on Digital Processing and Analysis, Bonas, J.C. Simon edit.
Diday E. (1984) "Une représentation visuelle des classes empiétantes". Rapport INRIA n° 291. Rocquencourt 78150, France.
Diday E., Lemaire J., Pouget J., Testu F. (1984) "Eléments d'Analyse des données". Dunod , Paris.
Diday E. (1986) "Orders and overlapping clusters by pyramids". Proceed. Multidimensional Data Analysis. Edits. J.D. De Loeuw et al, DSWO Press, Leiden, The Netherlands.
Diday E. (1987 a) "The symbolic aproach in clustering and related methods of Data Analysis" in "Classification and Related Methods of Data Analysis", Proc. IFCS, Aachen, Germany. H. Bock ed.North-Holland.
Diday E. (1987 b) "Introduction à l'approche symbolique en Analyse des Données ". Première Journées Symbolique-Numérique. Université Paris IX Dauphine. Décembre 1987.
Diday E. (1989) "Introduction à l'approche symbolique en analyse des données". RAIRO (Revue, d'Automatique, d'informatique et de Recherche Opérationnelle), vol. 23, n°2.
Diday E. (1995) " Probabilist, possibilist and belief objects for knowledge analysis " .Annals of Operations Research . 55, 227-276.
Diday E., Emilion R. (1995) "Lattices and Capacities in Analysis of Probabilist Objects". Proceed. of OSDA'95 (Ordinal and Symbolic Data Analysis). Springer Verlag Editor (1996).
Diday E., Emilion R. (1997) " Treillis de Galois maximaux et Capacités de Choquet" Compte rendu à l'Académie des Sciences. Analyse Mathématique, t. 324, série 1.
Diday E., Emilion R., Hillali Y. (1996) "Symbolic data analysis of probabilist objects by capacities and credibilities. XXXVIII Societa Italiana Di Statistica. Rimini, Italy.
Diday E.(1998) "L'Analyse des Données Symboliques: un cadre théorique et des outils" . Cahiers du CEREMADE.
Esposito F., Malerba D., Lisi F. (1998) "Flexible matching of boolean symbolic objects" Proc. NTTS'98 Sorrento, Italy. Nanopoulos, Garonna, Lauro edit. Eurostat (Luxembourg).
Ferraris, Gettler-Summa, C. Pardoux, H. Tong (1995) "Knowlege extraction using stochastic matrices: Application to elaborate a fishing strategy" Proc. Ordinal and Symbolic Data Analysis. Paris ; Diday, Lechevallier, Opitz edit.Springer Studies in Classification.
Fisher D.H., Langley P. (1986) "Conceptual clustering and its relation to Numerical Taxonomy". Workshop on Artificial Intelligence and Statistics" Addisson-Wesley, W. Gale édit.
Fisher D.H.,(1987) a, "Conceptual clustering learning from examples and inference". Proceed. 4th Workshop on Machine Learning. Irvine, California.
Ganascia J.G. (1991) "Charade: apprentissage de bases de connaissances". Cepadues, Kodratoff, Diday edit.
Gettler-Summa M. (1992) "Factorial axis interpretation by symbolic objects". Journées - Symbolique - Numérique. Université Paris IX- Dauphine. Lise-Ceremade.
Gettler-Summa M. (1997) "Symbolic marking: application on car accidents scenari" Proc. AMSDA, Capri, Italy.
Gigout E. (1998) " Graphical interpretation of symbolic objects resulting from data mining". Proc. KESDA'98, Eurostat. Luxembourg.
Gowda K.C., Diday E. (1992) "Symbolic clustering using a new similarity measure". IEEE Trans. Syst. Man and Cybernet. 22 (2), 368-378.
Gower J.C. (1974) "Maximal predictive classification". Biomet. Vol. 30, p. 643-644.
Hayes-Roth F., McDermott J. (1978) "An interference matching technique for inducing abstractions"Comm. ACM. Artificial Intelligence, Language processing.
Hebrail G. (1996) " SODAS (Symbolic Official Data Analysis System) ". Proceedings of IFCS?96, Kobe , Japan. Springer Verlag.
Jambu M. (1978) "Classification Automatique pour l'Analyse des Données". Dunod, Paris.
Jardine N., Sibson R. (1971) "Mathematical Taxonomy". John-Wiley and Sons. New-York.
Jussieu A.L. (1748) "Taxonomy. Coup d'oeil sur l'histoire et les principes des classifications botaniques". Dictionnaire d'Histoire Universelle.
Lance G.N. , Williams W.T. (1967) "A general theory of Classification sorting strategies: hierarchical systems". Comp. Jorn. Vol. 9 n°4.
Langley P., Sage S. (1984) "Conceptual clustering as discrimination learning". Proceed. Fifth Biennial Conf. the Canadian Soc. for Comp. Studies of Intelligence.
Labowitz M. (1983) " Generalization from natural language text" Cognit. Science 7, 1.
Lauro C., Palumbo F. (1998) "New approaches to Principal Component Analysis of Interval Data". Proc. NTTS'98 Sorrento, Italy. Nanopoulos, Garonna, Lauro edit. Eurostat (Luxembourg).
Lebart L., Morineau A., Piron M. (1995) "Statistique Exploratoire Multidimensionnelle" . Dunod, Paris.
Lebbe J. and Vignes R. (1991) "Génération de graphes d'identification à partir de descriptions de concepts", in Induction Symbolique-Numérique. Kodratoff, Diday edit. Cepadues (Toulouse).
Lerman I.C. (1970) "Les bases de la classification automatique" Gautier-Villars Paris.
Noirhomme-Fraiture, Rouard M. (1998) " Representation of Sub-Populations and Correlation with Zoom Star". Proc. NTTS'98 Sorrento, Italy. Nanopoulos, Garonna, Lauro edit. Eurostat (Luxembourg).
Mfoumoune E. (1998) "Les aspects algorithmiques de la classification ascendante pyramidale et incrémentale" . Thèse de doctorat, Université Paris 9 Dauphine.
Michalski, R. (1973), Aqual/1 -Computer Implementation of a variable-valued logic system VL1 and examples in Pattern Recognition". Proc. Int. Joint Conf. on Pattern Recognition, Washington D.C., pp 3-17.
Michalski R., Step R.E. (1983) "Automated construction of classifications Conceptual Clustering versus Numerical Taxonomy", IEEE Trans. on Pattern Analysis and Machine Intelligence. Vol. 5, n°4.
Michalski R., Diday E., Step R.E. (1982) " A recent advances in Data Analysis: clustering objects into classes characterized by conjonctive concepts". Progress in Pattern Recognition , vol 1. L; Kanal and A. Rosenfeld Eds.
Morgan J.N., Sonquist J.A. (1963) "Problems in the analysis of survey data : a proposal". J.A.S.A. 58, p. 417-434.
Pankhurst R.J. (1978) "Biological identification. The principle and practice of identificatin methods in biology". London, Edward Arnold.
Payne R.W. (1975) "Genkey: a program for construction diagnostic keys". Biological Identification with Computer .Pankhurst edit. P. 65-72. Acad. Press. London
Périnel (1996) "Segmentation et Analyse de Données Symboliques: Application à des données Probabilistes Imprécises". Thèse de doctorat, Université Paris 9 Dauphine.
Pollaillon G., Diday E. (1997) " Galois lattices of symbolic objects " Rapport du Ceremade University Paris9- Dauphine (February).
Pollaillon G. (1998) "Organisation et interprétation par les treillis de Galois de données de type multivalué, intervalle ou histogramme". Thèse de doctorat, Université Paris 9 Dauphine.
Rasson J.P., Lissoir S. (1998) "Symbolic Kernel discriminante analysis" Proc. NTTS'98 Sorrento, Italy. Nanopoulos, Garonna, Lauro edit. Eurostat (Luxembourg).
Quinlan J.R. (1986) "Induction of decision trees". Machine Learning 1, pp 81-106. Kluwer Acad. Publishers, Boston.
Ralambondrainy H. (1991) "Apprentissage dans le contexte d'un schéma de base de Données" Kodratoff, Diday edit. CEPADUES, Toulouse.
Rosch E. (1978) "Principle of categorization" . E. Rosch , B. Lloyd edits. Cognition and Categoriztion , pp 27-48 . Hillsdale, N.J.: Erlbaum.
Roux M. (1985) "Algorithmes de classification", Masson..
Saporta G. (1990) "Probabilités, Analyse des Données et Statistiques". Edit. Technip Paris.
Schweizer B. (1985) "Distributions are the numbers of the futur" . Proc. sec. Napoli Meeting on "The mathematics of fuzzy systems". Instituto di Mathematica delle Faculta di Mathematica delle Faculta di Achitectura, Universita degli studi di Napoli. p. 137-149.
Schweizer B. , Sklar A. (1983) " Probabilist metric spaces ". Elsever North-Holland, New-York.
Sneath P.H.A., Sokal R.R. (1973) "Numerical Taxonomy" Freeman and Comp. Publishers. San Francisco.
Sowa J. (1984) Conceptual Structures: Information processing in mind and machine. Addison Wesley.
Stéphan (1998) "Construction d'objets symboliques par synthèse des résultats de requêtes SQL. Thèse de doctorat, Université Paris 9 Dauphine.
Tukey J. W. (1958) "Exploratory Data Analysis". Addisson Wesley, Reading, Mass.
Vignes (1991) "Caractérisation automatique de groupes biologiques" . Thèse de doctorat, Université Paris 9 Dauphine.
Verde R., F.A.T. De Carvalho (1998) "Dependance rules influence on factorial representation of boolean symbolic objects". Proc. KESDA'98, Eurostat. Luxembourg.
Wagner H. (1973) "Begriff", Hanbuck Philosophischer Grundbegriffe, eds H. Krungs, H.M. Baumgartner and C. Wild, Kosel, Munchen ; PP. 191- 209.
Ward J.H. (1963) 3hierarchical groupings to optimize an objective function". J. Amer. Stat. Assoc. 58, pp. 236-244.
Wille R. (1982) "Restructuring lattice theory: an approach based on hierarchies of concepts." Proceed. Symp. Ordered Sets (I. Rival ed.), Reidel, Dordrecht-Boston.
Wille R. (1989) "Knowledge Acquisition by methods of formal concepts analysis, in Data Analysis, Learning symbolic and Numeric Knowledge. Diday edit. Nova Sciences Publishers.
Winston P. (1979) "Artificial Intelligence". Addison Wesley.
Ziani D. (1996 ) "Sélection de variables sur un ensemble d'objets
symboliques" Thèse de doctorat, Université Paris 9 Dauphine.