DB2SO: SO EXTRACTION FROM DB ( Version 2 - 28/8/99)


1.General information on DB2SO

    DB2SO is the part of SODAS software which enables the user to build a set of assertions from data stored in a relational database.

    It is assumed that a set of individuals is stored in the database and that these individuals are distributed into some groups. Then DB2SO can build one assertion for each group of individuals. In this process, mother/daughter variables and taxonomies on variable domains can also be associated with generated assertions.

    Theoretical aspects of DB2SO are developed in Chapter 5 of the SODAS Scientific Report.

    DB2SO is invoked from the IMPORT Menu of the SODAS software.

2. Typical use of DB2SO

    The typical interaction between the end-user and DB2SO is the following:

    + connection to a database

    + retrieving individuals distributed into groups by a SQL query

    . optionally defining mother/daughter variables among variables describing individuals

    . optionally adding single-valued variables to assertions

    . optionally adding multi-valued variables to assertions

    . optionally adding taxonomies on variable domains

    . optionally simplifying generated assertions using the reduction facility

    + specifying exportation and visualisation format of variables

    + visualising all work that you have already done

    + exporting generated assertions to a SODAS file

    + saving the current session to be able to restart it later

    Above items marked with a "+" are compulsory steps you typically always go through, while those marked with a "." are optional.

3. Connection to the database

DB2SO access to relational database is done by using the Microsoft ODBC facility. Please refer to Microsoft ODBC documentation which is available on your computer under Windows 95 or Windows NT.

We assume here you have already installed on your PC the necessary ODBC drivers to access your database. Standard ODBC drivers are included with Microsoft software (like EXCEL, MSACCESS drivers). If you need to access ORACLE databases or SAS files, you must ask your retailer for the corresponding driver.

DB2SO does not provide any menu to connect to the database because the user is automatically prompted to connect when necessary, i.e. if not yet connected before running a query.

If the user wants to change the database during a DB2SO session, the procedure is the following :

- the user disconnects from the current database by using the File/Disconnect Menu (see p.16),

- the user will then be prompted to specify a database before running the next query.

Another way to change the database is to click on the Modify button on the window in which SQL queries are written (see the Writing a SQL query Section on p.13).

Note that simultaneous connections to several databases are not supported.

When the user is prompted to connect to a database, the left window below appears. The user has to click on the Machine Data Source button, which leads to the right window below. In this window, you select the database. If you use an ACCESS database, you have to select the MS ACCESS driver which is available on your computer, and you will then be prompted to choose the file containing the database.

If you use another database system, you have to choose the corresponding ODBC driver.

Note : if you often use the same database file, it is possible to define an entry in the above right window which corresponds to a particular database file : this is done by clicking the New button in this window and follow instructions.

4. Writing a SQL query

In different steps of DB2SO, you will be prompted to write SQL queries. Depending on the step you are performing, a different window appears (below is shown the window for retrieving individuals from the database). But these windows always contain the two parts marked below by an arrow :

The Modify button enables the user to change the database to which the query is submitted.

The SQL query space is for typing the query in the SQL language.

Important note on SQL :

- you have to type standard ODBC SQL in this window but not SQL of the database system. For instance, if you use MS ACCESS database, strings are delimited by ² while they are delimited by ¢ in the standard ODBC SQL.

- if you ignore everything about ODBC SQL, the best way to run a query is to prepare it in your database system, store it under a name (for instance Q1) as a query or as a view, and type the following query in the above window : select * from Q1. This will return to DB2SO exactly what the query returns to your favourite database system.

5. File Menu

5.2 File/New Menu

Initialises a new DB2SO session : this is the command which leads you to specify the query which extracts individuals from the database. So, if you are not already connected to a database, you will be prompted by the ODBC menu to choose the database SQL queries will be submitted (See Connection to the database ).

Once connected to the database, the window above appears :

In this window, you have to type a SQL query of type 1, as specified in Chapter 5 of the SODAS Scientific Report.

The query must return one row for each individual and the expected structure of rows is (ID_IND, ID_GROUP, ATT1, ..., ATTp), where ID_IND is the individual identifier, ID_GROUP is the group identifier, and ATTi are attributes describing individuals. The number of attributes is variable but a minimum of one is required.

If the last column of the query is a weight associated with each individual, select the 'Last column is ponderation' option in the window. If you forget to select this option, the last column will lead to an interval variable in generated assertions.

If the query is supposed to return a very large number of tuples, this may overload the main memory: in this case, the user should select the sampling option in the window and specify the maximum number of individuals to be kept for each group (1000 by default, as shown in the above window).

When you press OK, the SQL query is submitted to the database and if there is no error in the query, the following operations are performed :

- individuals are retrieved from the database and stored in main memory,

- the array of assertions is generated.

Both the array of individuals and the array of assertions can then be viewed using the View Menu .

5.2 File/New by Join Menu

This option allows to build an array of assertions by joining two arrays of assertions stored into *.gaj files. Files named *.gaj are files storing DB2SO sessions .

The join operation takes as input two arrays of assertions. It produces a new array of assertions featuring the intersection of the two sets of assertions described by the union of the two corresponding sets of variables. Please refer to Chapter 5 of the SODAS Scientific Report for more information on the join operation on two arrays of assertions.

After specifying the names of two *.gaj files, the join of the two arrays of assertions is performed and can be saved in another *.gaj file, or exported to a SODAS file (*.sds).

Note that no array of individuals is associated with an array of assertions which is the result of a join operation. Consequently, all DB2SO options related to individuals are inactive in this case.

Join of two SODAS files (*.sds) is not supported by DB2SO.

5.3 File/Save (as) Menu

Saves the current session in a *.gaj file. This enables the user to retrieve later the current state of a DB2SO session, by using the File/Open Menu (see p.16).

Just give the name of the file. The extension of these files is .gaj by default.

5.4 File/Open Menu

Recovers the session stored in the selected *.gaj file. Just give the name of the file when prompted.

Note that *.sds files (see the File/Export (and view) Menu Section ) cannot be opened by this command.

5.5 File/Export (and view) Menu

Builds a SODAS file containing the array of assertions of the current session. SODAS files are files used as the input of SODAS methods. The extension of SODAS files is .sds.

A title and a subtitle must be given as comments to the contents of the SODAS file.

If assertions have been reduced the user can either export the reduced assertions (by selecting the option in the window above) or export the non-reduced assertions.

Assertions are exported according to the properties defined in the View/Variable properties Menu . These properties enable the user to specify which variables describing the assertions are exported and if they are exported as multi-valued (possibly modal) variables in the case of categorical variables.

The File/Export command just builds the *.sds file while the File/Export and view one also shows the generated file using Wordpad.

5.6 File/Disconnect Menu

Disconnects DB2SO from the database it is currently connected. This command is useful if the user wants to extract data from another database for further operations with DB2SO.

6. Modif Menu

After running a File/New, or File/New by join command, an array of assertions is generated by DB2SO. In the case of File/New, variables describing individuals lead to multi-valued variables describing assertions.

The Modify Menu enables the user to modify assertions handled by the current session. Several operations can be performed:

- adding and removing single-valued or multi-valued ( also called a set-valued multiple) variables to assertions,

- associating taxonomies with variable domains,

- specifying mother/daughter variables by the mean of rule definition,

- reducing assertions.

6.1 Modify/Add single-valued variables Menu

Adds one or several single-valued variables to assertions handled by the current session. Values of these variables for the assertions of the current session are extracted from the database by a SQL query of type 2 .

When choosing this option, the window above appears. The user just has to write a SQL query of type 2 which returns the values of the new variables. A SQL query of type 2 has the following structure:

ID_ASSERTION, ASSERTION_ATT1, ...., ASSERTION_ATTp

where ID_ ASSERTION is the assertion identifier and ASSERTION_ATTi are one or several variables describing assertions. Only one row should be returned by this query for each assertion ID.

If the query returns a null value for some variables of some assertions, a missing value will be associated with the corresponding assertion for these variables.

If the query returns ASSERTION ID's which do not exist in the array of the current session, information about these ASSERTION ID's is lost.

If the query does not return a row for an ASSERTION ID of the array of the current session, a missing value will be associated with all created variables of this assertion.

6.2 Modify/Add one set-valued multiple variable Menu

Adds one set-valued multiple variable to the array of assertions. Values for this new variable are extracted from the database by a SQL query of type 3.

The user just has to supply a SQL query in a window which is similar to the one for adding a single-valued variable.

In the case where the set of assertion ID's of the current session differs from those returned by the query, same rules apply as described in the preceding Section.

6.3 Modify/Remove variables Menu

Enables the user to remove variables which have been added either by the Modify/Add single-valued variables Menu or the Modify/Add one set-valued multiple variable Menu.

Just select the variable you want to remove in the window shown below :

6.4 Modify/Create a taxonomy Menu

Adds a taxonomy structure to the domain of a variable describing generated assertions.

Data associated with the taxonomy is extracted from the database by a SQL query of type 5a or 5b .

After invoking this command, the window shown above is displayed. The user has to select in the pop-down list the variable to which the taxonomy will be added. The system recognises if it is a query of type 5a or 5b.

Messages are displayed to the user if data returned by the query is invalid.

6.5 Modify/Remove taxonomies Menu

Removes the taxonomy associated with the selected variable.

6.6 Modify/Add a rule Menu

Enables the user to define mother/daughter variables among variables describing individuals (it is not possible to define mother/daughter variables on variables added with the Modify/Add single-valued variables Menu or Modify/Add one set-valued multiple variable Menu).

Mother/daughter variables are defined by rules of the form:

Salary IS APPLICABLE IF Activity IN ('Working', 'Retired with a pension')

where Salary and Activity are variable names, and 'Working' and 'Retired with a pension' are values of the Children variable.

This rules mean that there is no value of the Salary variable for individuals having a value of the Activity variable different from 'Working' and 'Retired with a pension', for instance for individuals having a value 'Unemployed'.

The specification of one rule is done through the window shown above. In this sample window, the following rule has been specified:

Childrens_school_meals IS APPLICABLE IF Children IN ('Yes')

First select the mother variable in the left list, then all values of the mother variable appear in the central list. Select one or several values in the central list. Finally select one or several variables which will be the daughter variables. When several daughter variables are selected, there will be as many rules generated.

After clicking OK in this window, DB2SO checks if individuals respect the definition of the rules:

Limitations:

Mother/daughter variables and their associated rules can be displayed by the View/Rules Menu .

6.7 Modify/Reduce assertions Menu

Performs a reduction of assertions: DB2SO automatically simplifies generated assertions by removing untypical individuals.

Please refer to the SODAS Scientific Report for more information about this reduction process.

The user has to give a threshold which corresponds to the minimum percentage of individuals still recognized by each assertion after the reduction process.

7. View Menu

7.1 View/Individuals Menu

Displays in a Wordpad window the current array of individuals (see the example below). Changes made in Wordpad are ineffective in DB2SO.

Note that you can save what is displayed with Wordpad in a file you may use for other purposes.

7.2 View/ Assertions Menu

Displays in a Wordpad window the generated assertions in the SOL language (see the example below).

7.3 View/ Taxonomies Menu

Selecting this option in the View Menu opens the window shown above. Select in the pop-down list the variable (here Socio_Economic_Group) for which the associated taxonomy has to be displayed. Then the taxonomy is displayed in the scrollbar window as shown above.

7.4 View/ Rules Menu

Selecting this option in the View Menu opens the window shown above. In the left part, rules representing mother/daughter variables are displayed. In the right part, the hierarchy induced by the definition of mother/daughter variable rules is displayed.

7.5 View/ Reduced assertions Menu

Same as the View/Assertions Menu except that reduced assertions are displayed, if a reduction process (see Modify/Reduce assertions Menu ) has been performed before.

7.1 View? Volume matrix Menu

Displays the volume matrix associated with the reduction process of assertions. For a definition of the volume of an assertion, please refer to Section 5.4.3 of Chapter 5 of the SODAS Scientific Report.

The matrix is a square matrix in which each cell corresponds to a couple of assertions. Each cell shows :

- a white square with a surface proportional to the volume of the corresponding row assertion,

- another white square with a surface proportional to the volume of the corresponding column assertion,

- a black square representing the volume of the intersection of the row and column assertions. (note that the two white squares are laid out so that their intersection is the black square).

When selecting the View initial volume matrix, red line squares show the state of this volume matrix without running the reduction process. This visualisation gives an idea of the effect of assertion reduction (the reduction process is supposed to reduce intersection - overlap - between assertions).

A zoom is available in this window.

7.7 View/Variable properties Menu

In this window, the user specifies the format of assertions for display (in the View/Assertions Menu and the View/Reduced assertions Menu) and exportation (in the File/Export (and view) Menu). Both variables generated from individual variables and variables directly added to assertions are accessible in this menu.

The user selects one variable in the left part of the window and chooses properties of this variable in the right part:

- selecting 'Inactive' makes the variable invisible both for display and exportation,

- for nominal multiple variables, the user can choose between a boolean, cardinality or probabilist representation.

Note that this last option is available for numerical variables and single-valued variables.

Note that cardinality representation is not available if a weight has been associated with individuals (see File/New Menu ).

8. SQL query types

8.1 SQL query of type 1

This query type is used to retrieve the following information from the database:

- individuals ID and attributes, possibly associated with a weight for each individual. These attributes lead to multi-valued variables describing generated assertions.

- groups ID which become the names of generated assertions.

- membership of each individual to one or several groups.

To summarize, the structure of this query type is the following:

IND_ID, GROUP_ID, IND_V1, IND_V2, ...., IND_Vk [, IND_WEIGHT]

This query must contain at least three columns (individual ID, group ID and one variable describing individuals). The weight column is optional and the number of individual attributes is variable.

Example:

CASENO

Region

Bedroom

Dining\living

Socio - Economic Group

114051 Northern metropolitan 2 1 Managers : large establishment
114061 Northern metropolitan 2 1 Own account non_professional
114071 Northern metropolitan 2 1 Intermediate non_manual ancill
114101 Northern metropolitan 3 2 Semi-skilled manual
114111 Northern metropolitan 2 1 Managers : large establishment
114131 Northern metropolitan 1 1 Intermediate non_manual ancill
114141 Northern metropolitan 3 1 Unskilled manual
114161 Northern metropolitan 3 1 Managers : large establishment
114171 Northern metropolitan 3 1 Skilled manual
201081 North non-metropolitan 2 1 Semi-skilled manual
201091 North non-metropolitan 4 3 Skilled manual
201101 North non-metropolitan 3 1 Personal service
201111 North non-metropolitan 2 1 Skilled manual
201131 North non-metropolitan 2 1 Junior non_manual
201141 North non-metropolitan 3 2 Unskilled manual
201171 North non-metropolitan 3 1 Intermediate non_manual ancill
... ...

This query returns one row per household. Individuals are households identified by the first column: CASENO. The group ID is defined by the second column, here REGION. There will be one assertion generated for each region. Households are described by two numerical and one qualitative attribute: the number of two types of rooms in the house and the socio-economic group of the household. Both numerical and qualitative attributes can describe individuals. The only difference is in the generalisation process where numerical variables lead to intervals of values while qualitative ones lead to lists of values or probability distributions. In this query, there is no weight column. If a weight column is present, this must be the last column and the user has to tell it to the system.

The weight column is used in the generalisation process for the computation of probability values associated with modal multi-valued variables. When no weight column is present, all individuals are assumed to have the same weight, equal to 1.

8.2 SQL query of type 2

This query type is used to retrieve the following information from the database:

- single-valued attributes describing groups: these attributes lead to single-valued variables describing generated assertions.

The structure of this query type is the following:

GROUP_ID, GROUP_V1, GROUP_V2, ...., GROUPE_Vp

This query must contain at least two columns: the group ID and one group attribute. The number of group attributes is variable.

Example:

Region

Number of cars

%Unemployment

East anglia 273000 8
East midlands non-metropolitan 491000 9
Greater london north east 197000 10
Greater london north west 154000 7
Greater london south east 149000 8
Greater london south west 160000 6
North non-metropolitan 260000 10
North west metropolitan 458000 11
... ... ...

This query returns two single-valued attributes describing regions. There is one row per region, thus one value per region for every attribute. Here the two attributes are numerical but they could have been qualitative. Both qualitative and numerical attributes can be returned in the same query.

8.3 SQL query of type 3

This query type is used to retrieve the following information from the database:

- exactly one native multi-valued (qualitative) attribute describing groups, which lead to a multi-valued variable describing generated assertions

The structure of this query type is the following:

GROUP_ID, MULT_VAL_ATT, CARDINALITY

Example:

Region

Accomodation type

Number of cases

East anglia Caravan houseboat 1
East anglia Part of house converted flat o 16
East anglia Purpose-built flat or maisonet 22
East anglia Whole house bungalow detache 85
East anglia Whole house bungalow semi-de 83
East anglia Whole house bungalow terrace 66
East midlands non-metropolitan Caravan houseboat 3
East midlands non-metropolitan Dwelling with business premise 1
East midlands non-metropolitan Part of house converted flat o 17
East midlands non-metropolitan Purpose-built flat or maisonet 42
East midlands non-metropolitan Whole house bungalow detache 134
East midlands non-metropolitan Whole house bungalow semi-de 182
East midlands non-metropolitan Whole house bungalow terrace 112
Greater london north east Dwelling with business premise 1
Greater london north east Part of house converted flat o 17
... ... ...

The first column of the query describes the group ID, the second one contains modalities of the multi-valued attribute, and the third one is the cardinality of the modality within the group.

This query type is useful to acquire multi-valued variables describing assertions when they are not computed from an underlying population. It is not adviced to use this feature if the attribute describes an underlying population: in this case, it is preferable to use a SQL query of type 1 retrieving the underlying population, run the reduction process, and finally join generated assertions to the current ones.

8.4 SQL query of type 4 (not supported by version V2 of DB2SO)

This query type is used to retrieve the following information from the database:

- exactly one native interval (numerical) attribute describing groups, which lead to an interval variable describing generated assertions

The structure of this query type is the following:

GROUP_ID, MIN_VALUE, MAX_VALUE

where MIN_VALUE and MAX_VALUE are the bounds of the interval of the new variable describing groups identified by GROUP_ID.

Example:

REGION

LOWER

UPPER

East anglia 1531,59 3058,34
East midlands non-metropolitan -84,31 259,95
Greater london north east 3525,06 6878,74
Greater london north west 27132,72 32852,73
Greater london south east 33847,21 41307,62
Greater london south west -52,99 326,95
North non-metropolitan -65,69 405,41
North west metropolitan 1833,84 3336,49
... ... ...

In this example, this query returns, for each region, a confidence interval for the average income of a particular class of people. This will create an interval numerical variable describing assertions. As for queries of type 3, it is not adviced to use this feature when this data can be calculated from an underlying population.

Remark:

This SQL type can be remplaced by a SQL  query of type 1 on the folowing table;

ID

REGION

V1

ID_1 East anglia 1531,59
ID_2 East anglia 3058,34
ID_3 East midlands non-metropolitan -84,31
ID_4 East midlands non-metropolitan 259,95
ID_5 Greater london north east 3525,06
ID_6 Greater london north east 6878,74
ID_7 Greater london north west 27132,72
ID_8 Greater london north west 32852,73
ID... ... ...

The struture of this query type is the following:

IND_ID, GROUP_ID, IND_V1

8.5 SQL query of type 5

This query type is used to retrieve the following information from the database:

- a taxonomy on values of a particular variable

This information can be retrieved from the database by two means, corresponding to two different ways of representing a taxonomy in a table:

SQL query of type 5.a ATT_VALUE, LEV1_VALUE, LEV2_VALUE, ...., LEVp_VALUE

This is the representation suited when all leaves of the taxonomy are at the same depth in the tree. A typical example is a query returning TOWN, DEPARTMENT, REGION, COUNTRY to define a taxonomy on a TOWN variable.

SQL query of type 5.b CHILD_ATT_VALUE, PARENT_ATT_VALUE

This is a child/parent representation of the taxonomy which links together the different values of an attribute domain. An example of this representation is given below.

Example of the child/parent representation

Socio - Economic Group

Parent Group

Agricultural workers Employed
Armed forces Employed
Employers Self-employed
Employers: large establishment Employers
Employers: small establishment Employers
Farmers Self-employed
Farmers:emp&mgrs Farmers
Farmers:own account Farmers
Intermediate non_manual ancill Non-manual
Intermediate non_manual foreme Non-manual
Junior non_manual Non-manual
Managers Employed
Managers : large establishment Managers
Managers : small establishment Managers
Manual Employed
Manual - foremen\supervisors Managers
Non-manual Employed
Own account non_professional Self-employed
Personal service Employed
Professional - employee Employed
Professional - self employed Self-employed
Semi-skilled manual Manual
Skilled manual Manual
Unskilled manual Manual

which leads to the following taxonomy:

Employed

Self-employed

9. Error Messages

Not any errors messages are coded. They are clearly expressed and tell the user how to fix the problem.