DB2SO: SO EXTRACTION FROM DB ( Version 2 - 28/8/99)
1 General information on DB2SO
5.1 File/New Menu
5.4 File/Open Menu
6.1 Modify/Add single-valued variables Menu
6.2 Modify/Add one set-valued multiple variable Menu
6.3 Modify/Remove variables Menu
6.4 Modify/Create a taxonomy Menu
1.General information on DB2SO ![]()
DB2SO is the part of SODAS software which enables the user to build a set of assertions from data stored in a relational database.
It is assumed that a set of individuals is stored in the database and that these individuals are distributed into some groups. Then DB2SO can build one assertion for each group of individuals. In this process, mother/daughter variables and taxonomies on variable domains can also be associated with generated assertions.
Theoretical aspects of DB2SO are developed in Chapter 5 of the SODAS Scientific Report.
DB2SO is invoked from the IMPORT Menu of the SODAS software.
The typical interaction between the end-user and DB2SO is the following:
+ retrieving individuals distributed into groups by a SQL query
. optionally defining mother/daughter variables among variables describing individuals
. optionally adding single-valued variables to assertions
. optionally adding multi-valued variables to assertions
. optionally adding taxonomies on variable domains
. optionally simplifying generated assertions using the reduction facility
+ specifying exportation and visualisation format of variables
+ visualising all work that you have already done
+ exporting generated assertions to a SODAS file
+ saving the current session to be able to restart
it later
Above items marked with a "+" are compulsory steps you typically always go through, while those marked with a "." are optional.
DB2SO access to relational database is done by using the Microsoft ODBC facility. Please refer to Microsoft ODBC documentation which is available on your computer under Windows 95 or Windows NT.
We assume here you have already installed on your PC the necessary ODBC drivers to access your database. Standard ODBC drivers are included with Microsoft software (like EXCEL, MSACCESS drivers). If you need to access ORACLE databases or SAS files, you must ask your retailer for the corresponding driver.
DB2SO does not provide any menu to connect to the database because the user is automatically prompted to connect when necessary, i.e. if not yet connected before running a query.
If the user wants to change the database during a DB2SO session, the procedure is the following :
- the user disconnects from the current database by using the File/Disconnect Menu (see p.16),
- the user will then be prompted to specify a database before running
the next query.
Another way to change the database is to click on the Modify button on the window in which SQL queries are written (see the Writing a SQL query Section on p.13).
Note that simultaneous connections to several databases are not supported.
When the user is prompted to connect to a database, the left window below appears. The user has to click on the Machine Data Source button, which leads to the right window below. In this window, you select the database. If you use an ACCESS database, you have to select the MS ACCESS driver which is available on your computer, and you will then be prompted to choose the file containing the database.
If you use another database system, you have to choose the corresponding ODBC driver.
Note : if you often use the same database file, it is possible to define an entry in the above right window which corresponds to a particular database file : this is done by clicking the New button in this window and follow instructions.
In different steps of DB2SO, you will be prompted to write SQL queries. Depending on the step you are performing, a different window appears (below is shown the window for retrieving individuals from the database). But these windows always contain the two parts marked below by an arrow :
The Modify button enables the user to change the database to which the query is submitted.
The SQL query space is for typing the query in the SQL language.
Important note on SQL :
- you have to type standard ODBC SQL in this window but
not SQL of the database system. For instance, if you use MS ACCESS
database, strings are delimited by ² while they are delimited by ¢
in the standard ODBC SQL.
- if you ignore everything about ODBC SQL, the best way to run a query is to prepare it in your database system, store it under a name (for instance Q1) as a query or as a view, and type the following query in the above window : select * from Q1. This will return to DB2SO exactly what the query returns to your favourite database system.
Initialises a new DB2SO session : this is the command which leads you to specify the query which extracts individuals from the database. So, if you are not already connected to a database, you will be prompted by the ODBC menu to choose the database SQL queries will be submitted (See Connection to the database ).
Once connected to the database, the window above appears :
In this window, you have to type a SQL query of type 1, as specified in Chapter 5 of the SODAS Scientific Report.
The query must return one row for each individual and the expected structure of rows is (ID_IND, ID_GROUP, ATT1, ..., ATTp), where ID_IND is the individual identifier, ID_GROUP is the group identifier, and ATTi are attributes describing individuals. The number of attributes is variable but a minimum of one is required.
If the last column of the query is a weight associated with each individual, select the 'Last column is ponderation' option in the window. If you forget to select this option, the last column will lead to an interval variable in generated assertions.
If the query is supposed to return a very large number of tuples, this may overload the main memory: in this case, the user should select the sampling option in the window and specify the maximum number of individuals to be kept for each group (1000 by default, as shown in the above window).
When you press OK, the SQL query is submitted to the database and if there is no error in the query, the following operations are performed :
- individuals are retrieved from the database and stored in main memory,
- the array of assertions is generated.
Both the array of individuals and the array of assertions can then be viewed using the View Menu .
This option allows to build an array of assertions by joining two arrays of assertions stored into *.gaj files. Files named *.gaj are files storing DB2SO sessions .
The join operation takes as input two arrays of assertions. It produces a new array of assertions featuring the intersection of the two sets of assertions described by the union of the two corresponding sets of variables. Please refer to Chapter 5 of the SODAS Scientific Report for more information on the join operation on two arrays of assertions.
After specifying the names of two *.gaj files, the join of the two arrays of assertions is performed and can be saved in another *.gaj file, or exported to a SODAS file (*.sds).
Note that no array of individuals is associated with an array of assertions which is the result of a join operation. Consequently, all DB2SO options related to individuals are inactive in this case.
Join of two SODAS files (*.sds) is not supported by DB2SO.
Saves the current session in a *.gaj file. This enables the user to retrieve later the current state of a DB2SO session, by using the File/Open Menu (see p.16).
Just give the name of the file. The extension of these files is .gaj by default.
Recovers the session stored in the selected *.gaj file. Just give the name of the file when prompted.
Note that *.sds files (see the File/Export (and view) Menu Section ) cannot be opened by this command.
5.5 File/Export (and view) Menu![]()
Builds a SODAS file containing the array of assertions of the current session. SODAS files are files used as the input of SODAS methods. The extension of SODAS files is .sds.
A title and a subtitle must be given as comments to the contents of the SODAS file.
If assertions have been reduced the user can either export the reduced assertions (by selecting the option in the window above) or export the non-reduced assertions.
Assertions are exported according to the properties defined in the View/Variable properties Menu . These properties enable the user to specify which variables describing the assertions are exported and if they are exported as multi-valued (possibly modal) variables in the case of categorical variables.
The File/Export command just builds the *.sds file while the File/Export and view one also shows the generated file using Wordpad.
Disconnects DB2SO from the database it is currently connected. This command is useful if the user wants to extract data from another database for further operations with DB2SO.
After running a File/New, or File/New by join command, an array of assertions is generated by DB2SO. In the case of File/New, variables describing individuals lead to multi-valued variables describing assertions.
The Modify Menu enables the user to modify assertions handled by the current session. Several operations can be performed:
- adding and removing single-valued or multi-valued ( also called a set-valued multiple) variables to assertions,
- associating taxonomies with variable domains,
- specifying mother/daughter variables by the mean of rule definition,
- reducing assertions.
6.1 Modify/Add single-valued variables
Menu![]()
Adds one or several single-valued variables to assertions handled by the current session. Values of these variables for the assertions of the current session are extracted from the database by a SQL query of type 2 .
When choosing this option, the window above appears. The user just has to write a SQL query of type 2 which returns the values of the new variables. A SQL query of type 2 has the following structure:
ID_ASSERTION, ASSERTION_ATT1, ...., ASSERTION_ATTp
where ID_ ASSERTION is the assertion identifier and ASSERTION_ATTi are one or several variables describing assertions. Only one row should be returned by this query for each assertion ID.
If the query returns a null value for some variables of some assertions, a missing value will be associated with the corresponding assertion for these variables.
If the query returns ASSERTION ID's which do not exist in the array of the current session, information about these ASSERTION ID's is lost.
If the query does not return a row for an ASSERTION ID of the array of the current session, a missing value will be associated with all created variables of this assertion.
6.2 Modify/Add one set-valued multiple
variable Menu![]()
Adds one set-valued multiple variable to the array of assertions. Values for this new variable are extracted from the database by a SQL query of type 3.
The user just has to supply a SQL query in a window which is similar to the one for adding a single-valued variable.
In the case where the set of assertion ID's of the current session differs from those returned by the query, same rules apply as described in the preceding Section.
6.3 Modify/Remove variables Menu![]()
Enables the user to remove variables which have been added either by the Modify/Add single-valued variables Menu or the Modify/Add one set-valued multiple variable Menu.
Just select the variable you want to remove in the window shown below :
6.4 Modify/Create a taxonomy Menu![]()
Adds a taxonomy structure to the domain of a variable describing generated assertions.
Data associated with the taxonomy is extracted from the database by a SQL query of type 5a or 5b .
After invoking this command, the window shown above is displayed. The user has to select in the pop-down list the variable to which the taxonomy will be added. The system recognises if it is a query of type 5a or 5b.
Messages are displayed to the user if data returned by the query is invalid.
6.5 Modify/Remove taxonomies Menu![]()
Removes the taxonomy associated with the selected variable.
Enables the user to define mother/daughter variables among variables describing individuals (it is not possible to define mother/daughter variables on variables added with the Modify/Add single-valued variables Menu or Modify/Add one set-valued multiple variable Menu).
Mother/daughter variables are defined by rules of the form:
Salary IS APPLICABLE IF Activity IN ('Working', 'Retired with a pension')
where Salary and Activity are variable names, and 'Working' and 'Retired with a pension' are values of the Children variable.
This rules mean that there is no value of the Salary variable for individuals having a value of the Activity variable different from 'Working' and 'Retired with a pension', for instance for individuals having a value 'Unemployed'.
The specification of one rule is done through the window shown above. In this sample window, the following rule has been specified:
Childrens_school_meals IS APPLICABLE IF Children IN ('Yes')
First select the mother variable in the left list, then all values of the mother variable appear in the central list. Select one or several values in the central list. Finally select one or several variables which will be the daughter variables. When several daughter variables are selected, there will be as many rules generated.
After clicking OK in this window, DB2SO checks if individuals respect the definition of the rules:
Limitations:
Mother/daughter variables and their associated rules can be displayed by the View/Rules Menu .
6.7 Modify/Reduce assertions Menu![]()
Performs a reduction of assertions: DB2SO automatically simplifies generated assertions by removing untypical individuals.
Please refer to the SODAS Scientific Report for more information about this reduction process.
The user has to give a threshold which corresponds to the minimum percentage of individuals still recognized by each assertion after the reduction process.
Displays in a Wordpad window the current array of individuals (see the example below). Changes made in Wordpad are ineffective in DB2SO.
Note that you can save what is displayed with Wordpad in a file you may use for other purposes.
Displays in a Wordpad window the generated assertions in the SOL language (see the example below).
Selecting this option in the View Menu opens the window shown above. Select in the pop-down list the variable (here Socio_Economic_Group) for which the associated taxonomy has to be displayed. Then the taxonomy is displayed in the scrollbar window as shown above.
Selecting this option in the View Menu opens the window shown above. In the left part, rules representing mother/daughter variables are displayed. In the right part, the hierarchy induced by the definition of mother/daughter variable rules is displayed.
7.5 View/ Reduced assertions Menu![]()
Same as the View/Assertions Menu except that reduced assertions are displayed, if a reduction process (see Modify/Reduce assertions Menu ) has been performed before.
Displays the volume matrix associated with the reduction process of assertions. For a definition of the volume of an assertion, please refer to Section 5.4.3 of Chapter 5 of the SODAS Scientific Report.
The matrix is a square matrix in which each cell corresponds to a couple of assertions. Each cell shows :
- a white square with a surface proportional to the volume of the corresponding row assertion,
- another white square with a surface proportional to the volume of the corresponding column assertion,
- a black square representing the volume of the intersection of the
row and column assertions. (note that the two white squares are laid out
so that their intersection is the black square).
When selecting the View initial volume matrix, red line squares show the state of this volume matrix without running the reduction process. This visualisation gives an idea of the effect of assertion reduction (the reduction process is supposed to reduce intersection - overlap - between assertions).
A zoom is available in this window.
7.7 View/Variable properties Menu![]()
In this window, the user specifies the format of assertions for display (in the View/Assertions Menu and the View/Reduced assertions Menu) and exportation (in the File/Export (and view) Menu). Both variables generated from individual variables and variables directly added to assertions are accessible in this menu.
The user selects one variable in the left part of the window and chooses properties of this variable in the right part:
- selecting 'Inactive' makes the variable invisible both for display and exportation,
- for nominal multiple variables, the user can choose between
a boolean, cardinality or probabilist representation.
Note that this last option is available for numerical variables and single-valued variables.
Note that cardinality representation is not available if a weight has been associated with individuals (see File/New Menu ).
This query type is used to retrieve the following information from the database:
- individuals ID and attributes, possibly associated with a weight for each individual. These attributes lead to multi-valued variables describing generated assertions.
- groups ID which become the names of generated assertions.
- membership of each individual to one or several groups.
To summarize, the structure of this query type is the following:
IND_ID, GROUP_ID, IND_V1, IND_V2, ...., IND_Vk [, IND_WEIGHT]
This query must contain at least three columns (individual ID, group ID and one variable describing individuals). The weight column is optional and the number of individual attributes is variable.
Example:
|
CASENO |
Region |
Bedroom |
Dining\living |
Socio - Economic Group |
| 114051 | Northern metropolitan | 2 | 1 | Managers : large establishment |
| 114061 | Northern metropolitan | 2 | 1 | Own account non_professional |
| 114071 | Northern metropolitan | 2 | 1 | Intermediate non_manual ancill |
| 114101 | Northern metropolitan | 3 | 2 | Semi-skilled manual |
| 114111 | Northern metropolitan | 2 | 1 | Managers : large establishment |
| 114131 | Northern metropolitan | 1 | 1 | Intermediate non_manual ancill |
| 114141 | Northern metropolitan | 3 | 1 | Unskilled manual |
| 114161 | Northern metropolitan | 3 | 1 | Managers : large establishment |
| 114171 | Northern metropolitan | 3 | 1 | Skilled manual |
| 201081 | North non-metropolitan | 2 | 1 | Semi-skilled manual |
| 201091 | North non-metropolitan | 4 | 3 | Skilled manual |
| 201101 | North non-metropolitan | 3 | 1 | Personal service |
| 201111 | North non-metropolitan | 2 | 1 | Skilled manual |
| 201131 | North non-metropolitan | 2 | 1 | Junior non_manual |
| 201141 | North non-metropolitan | 3 | 2 | Unskilled manual |
| 201171 | North non-metropolitan | 3 | 1 | Intermediate non_manual ancill |
| ... | ... |
This query returns one row per household. Individuals are households identified by the first column: CASENO. The group ID is defined by the second column, here REGION. There will be one assertion generated for each region. Households are described by two numerical and one qualitative attribute: the number of two types of rooms in the house and the socio-economic group of the household. Both numerical and qualitative attributes can describe individuals. The only difference is in the generalisation process where numerical variables lead to intervals of values while qualitative ones lead to lists of values or probability distributions. In this query, there is no weight column. If a weight column is present, this must be the last column and the user has to tell it to the system.
The weight column is used in the generalisation process for the computation
of probability values associated with modal multi-valued variables. When
no weight column is present, all individuals are assumed to have the same
weight, equal to 1.
This query type is used to retrieve the following information from the database:
- single-valued attributes describing groups: these attributes lead to single-valued variables describing generated assertions.
The structure of this query type is the following:
GROUP_ID, GROUP_V1, GROUP_V2, ...., GROUPE_Vp
This query must contain at least two columns: the group ID and one group attribute. The number of group attributes is variable.
Example:
|
Region |
Number of cars |
%Unemployment |
| East anglia | 273000 | 8 |
| East midlands non-metropolitan | 491000 | 9 |
| Greater london north east | 197000 | 10 |
| Greater london north west | 154000 | 7 |
| Greater london south east | 149000 | 8 |
| Greater london south west | 160000 | 6 |
| North non-metropolitan | 260000 | 10 |
| North west metropolitan | 458000 | 11 |
| ... | ... | ... |
This query returns two single-valued attributes describing regions. There is one row per region, thus one value per region for every attribute. Here the two attributes are numerical but they could have been qualitative. Both qualitative and numerical attributes can be returned in the same query.
This query type is used to retrieve the following information from the database:
- exactly one native multi-valued (qualitative) attribute describing groups, which lead to a multi-valued variable describing generated assertions
The structure of this query type is the following:
GROUP_ID, MULT_VAL_ATT, CARDINALITY
Example:
|
Region |
Accomodation type |
Number of cases |
| East anglia | Caravan houseboat | 1 |
| East anglia | Part of house converted flat o | 16 |
| East anglia | Purpose-built flat or maisonet | 22 |
| East anglia | Whole house bungalow detache | 85 |
| East anglia | Whole house bungalow semi-de | 83 |
| East anglia | Whole house bungalow terrace | 66 |
| East midlands non-metropolitan | Caravan houseboat | 3 |
| East midlands non-metropolitan | Dwelling with business premise | 1 |
| East midlands non-metropolitan | Part of house converted flat o | 17 |
| East midlands non-metropolitan | Purpose-built flat or maisonet | 42 |
| East midlands non-metropolitan | Whole house bungalow detache | 134 |
| East midlands non-metropolitan | Whole house bungalow semi-de | 182 |
| East midlands non-metropolitan | Whole house bungalow terrace | 112 |
| Greater london north east | Dwelling with business premise | 1 |
| Greater london north east | Part of house converted flat o | 17 |
| ... | ... | ... |
The first column of the query describes the group ID, the second one contains modalities of the multi-valued attribute, and the third one is the cardinality of the modality within the group.
This query type is useful to acquire multi-valued variables describing assertions when they are not computed from an underlying population. It is not adviced to use this feature if the attribute describes an underlying population: in this case, it is preferable to use a SQL query of type 1 retrieving the underlying population, run the reduction process, and finally join generated assertions to the current ones.
8.4 SQL query of type 4 (not supported
by version V2 of DB2SO)![]()
This query type is used to retrieve the following information from the database:
- exactly one native interval (numerical) attribute describing groups, which lead to an interval variable describing generated assertions
The structure of this query type is the following:
GROUP_ID, MIN_VALUE, MAX_VALUE
where MIN_VALUE and MAX_VALUE are the bounds of the interval of the new variable describing groups identified by GROUP_ID.
Example:
|
REGION |
LOWER |
UPPER |
| East anglia | 1531,59 | 3058,34 |
| East midlands non-metropolitan | -84,31 | 259,95 |
| Greater london north east | 3525,06 | 6878,74 |
| Greater london north west | 27132,72 | 32852,73 |
| Greater london south east | 33847,21 | 41307,62 |
| Greater london south west | -52,99 | 326,95 |
| North non-metropolitan | -65,69 | 405,41 |
| North west metropolitan | 1833,84 | 3336,49 |
| ... | ... | ... |
In this example, this query returns, for each region, a confidence interval for the average income of a particular class of people. This will create an interval numerical variable describing assertions. As for queries of type 3, it is not adviced to use this feature when this data can be calculated from an underlying population.
Remark:
This SQL type can be remplaced by a SQL query of type 1 on the folowing table;
|
ID |
REGION |
V1 |
| ID_1 | East anglia | 1531,59 |
| ID_2 | East anglia | 3058,34 |
| ID_3 | East midlands non-metropolitan | -84,31 |
| ID_4 | East midlands non-metropolitan | 259,95 |
| ID_5 | Greater london north east | 3525,06 |
| ID_6 | Greater london north east | 6878,74 |
| ID_7 | Greater london north west | 27132,72 |
| ID_8 | Greater london north west | 32852,73 |
| ID... | ... | ... |
The struture of this query type is the following:
IND_ID, GROUP_ID, IND_V1
This query type is used to retrieve the following information from the database:
- a taxonomy on values of a particular variable
This information can be retrieved from the database by two means, corresponding to two different ways of representing a taxonomy in a table:
SQL query of type 5.a ATT_VALUE, LEV1_VALUE, LEV2_VALUE, ...., LEVp_VALUE
This is the representation suited when all leaves of the taxonomy are at the same depth in the tree. A typical example is a query returning TOWN, DEPARTMENT, REGION, COUNTRY to define a taxonomy on a TOWN variable.
SQL query of type 5.b CHILD_ATT_VALUE, PARENT_ATT_VALUE
This is a child/parent representation of the taxonomy which links together the different values of an attribute domain. An example of this representation is given below.
Example of the child/parent representation
|
Socio - Economic Group |
Parent Group |
| Agricultural workers | Employed |
| Armed forces | Employed |
| Employers | Self-employed |
| Employers: large establishment | Employers |
| Employers: small establishment | Employers |
| Farmers | Self-employed |
| Farmers:emp&mgrs | Farmers |
| Farmers:own account | Farmers |
| Intermediate non_manual ancill | Non-manual |
| Intermediate non_manual foreme | Non-manual |
| Junior non_manual | Non-manual |
| Managers | Employed |
| Managers : large establishment | Managers |
| Managers : small establishment | Managers |
| Manual | Employed |
| Manual - foremen\supervisors | Managers |
| Non-manual | Employed |
| Own account non_professional | Self-employed |
| Personal service | Employed |
| Professional - employee | Employed |
| Professional - self employed | Self-employed |
| Semi-skilled manual | Manual |
| Skilled manual | Manual |
| Unskilled manual | Manual |
which leads to the following taxonomy:
Employed
Agricultural workers
Armed forces
Personal service
Professional - employee
Manual
Semi-skilled manual
Skilled manual
Unskilled manual
Non-manual
Intermediate non_manual ancill
Intermediate non_manual foreme
Junior non_manual
Managers
Managers : large establishment
Managers : small establishment
Manual - foremen\supervisors
Self-employed
Professional - self employed
Own account non_professional
Employers
Employers: large establishment
Employers: small establishment
Farmers
Farmers:own account
Farmers:emp&mgrs
Not any errors messages are coded. They are clearly expressed and tell
the user how to fix the problem.