We will first assume that the dataset of Figure 4 already exists and the knowledge engineer wishes to analyze it and transfer the resulting knowledge structures to a shell. He or she may create a dataset by using the interactive elicitation facilities to enter the cases, attributes, values and control information such as the decision attribute.
Running the Induct tool with this dataset generates the rules shown in Figure 8 which are logically the same as those in Figure 7. The two numbers following each rule are the percentage of cases for which the rule is correct, and the percentage probability that the rule might arise by chance, respectively. For example, the first rule in Figure 8 is 100% correct and has a 0.904% probability of arising by chance.
Figure 8 Induct analysis of dataset in KSS0
The dataset of Figure 4 is without conflicts in that a given set of values of the attributes always corresponds to the same decision. Hence all the rules have 100% probability of being correct. As shown later, if conflicting data is entered, Induct may generate rules that are less than 100% correct. The required maximum probability that a rule might be due to chance is a parameter which can be set in the Induct dialog. It corresponds exactly to the normal "level of significance" of conventional statistics, for example, that a result "is significant at the 5% level."
The adjustment of this parameter is particularly interesting because the datasets elicited from experts may be very much smaller than those required for statistical testing yet still be effective in terms of knowledge transfer. The essence of expertise transfer techniques is that the expert's expertise makes available high quality data. That is, experts can often provide a minimal set of relevant attributes and a complete set of critical cases with correct decisions. In these circumstances the significance level can be set to accept rules which from a statistical point of view might well arise by chance. The statistical logic is that the data is correct and representative of many independent cases, and hence it could be entered in exactly the same form a number of times. This repetition of correct data would make the statistical significance of the results as great as required. Hence it is usual to start with the significance in Induct at 100%.
In practice, experts may enter incomplete sets of attributes or incorrect cases and some of the rules generated by Induct will be spurious. The statistical significance measure is then usually a very good indicator of the most suspect rules. The expert can inspect the rules and see if they make sense. The knowledge engineer can then set the significance threshold in Induct to exclude the spurious rules.
The contact lens data is complete and correct so that the rules in Figure 8 do not need to be tested for significance. Even so, the significance level indicates those rules which are most difficult to conclude from the data. For example the fifth through seventh rules have 39.1% probability of arising by chance if the dataset was a random sample. This is because they correspond to three critical exceptions to a much simpler rule set: If the tear production is normal then if the client is astigmatic fit hard lenses or if not astigmatic fit soft. The three exceptions to this simple rule set are exhibited only once each in the lens dataset and hence the rules to cover them are not validated very well. This is what one might expect in that exceptions by definition are infrequent events.
gaines@cpsc.ucalgary.ca 19-Sep-95