"Transparent but accurate evolutionary regression combining new linguistic fuzzy grammar and a novel interpretable linear extension".

accepted in the International Journal of Fuzzy Systems

Authors:

Carmen Biedma-Rdguez, María José Gacto, Augusto Anguita-Ruiz, Jesús Alcalá-Fdez and Rafael Alcalá

Cite as:

Biedma-Rdguez, C., Gacto, M.J., Anguita-Ruiz, A., Alcalá-Fdez, J. and Alcalá R. Transparent but Accurate Evolutionary Regression Combining New Linguistic Fuzzy Grammar and a Novel Interpretable Linear Extension. Int. J. Fuzzy Syst. (2022). Doi: 10.1007/s40815-022-01324-w
https://doi.org/10.1007/s40815-022-01324-w

Summary:

Preliminaries: Interpretability measures
Experimental Study
Analysis of some linguistic model examples
Comparison with some state-of-the-art general purpose "accuracy oriented" methods

In the following sections, the complementary materials (datasets) of the referred paper can be downloaded.

Even though there are a few new proposals regarding the design of interpretable models for classification problems, they are not directly applicable to those regression problems where modeling continuous complex surfaces with only a few rules that aim to separate the different values of the output variable is quite difficult. This is why our proposal actually aims to find and separate "tendencies", since they catch the continuous nature in regression problems better. To the best of our knowledge, there are no recent proposals that allow us to obtain interpretable and really simple FRBS models (with only a few rules) for regression.

A. Preliminaries: Interpretability measures considered to ensure the model comprehensibility

This part is complementary to the analogous section in the paper, and intends to provide a short description of the well-known metrics considered, in order to ensure clear linguistic semantics. In the following two subsections, we briefly introduce the GM3M and RMI indexes devoted to assessing the preservation of initial linguistic concepts (proximity to the fully interpretable, equidistributed strong fuzzy partition) and the rules consistency (absence of contradiction), respectively.

A.1. Gm3m index for Semantic Interpretability at the DB level

Formally, the Gm3m (Gacto et al 2010) index is defined as the geometric mean of three complementary metrics used to quantify the proximity between a given membership function and the function initially defined for the associated linguistic term (Equation I). Each of the three metrics takes into account, respectively, aspects such as the relative displacement of the membership function (δ), differences in the relative symmetry of their slopes (γ) and differences in the area (ρ) between both membership functions.

These metrics were defined to measure the interpretability when the original definitions of the membership functions need to be modified, which is essential for learning accurate/trustworthy models. It is intended that, with the use of the geometric mean, if one of the metrics takes very small values (low interpretability), the value of Gm3m will also take small values. Gm3m takes values in the range [0,1] with 0 being the lowest level of interpretability and 1 being the highest level.

The complete description, formulation and some examples of how to compute these metrics can be found in (Gacto et al 2010) (for the commonly used triangular membership functions) and (Galende et al 2014) (extension for other types of membership functions, based on considering core and slopes separately for each metric).

A.2. Rmi index for Semantic Interpretability at the RB level

Rmi (Rule Meaning Index) (Galende et al 2014 ) is a semantic interpretability index at the RB level that can be used together with Gm3m to optimize and compare the semantic interpretability of different linguistic FRBSs. This index aims to assess whether the linguistic model outputs are the same as those described by their rules in their respective activation zones (and more particularly in their cores or to certain degree, their α-cuts).

For a given linguistic FRBS, Rmi is computed as the worst case for the individual values of Rmi(R_i) of each rule R_i in the whole RB. Individually, i.e. for each rule, the goal of Rmi(R_i) is to evaluate the degree of reliability of the R_i rule with respect to the global output that the whole model would infer in the activation zone of this rule (in its core or to certain degree, its α-cut ). Therefore, this index also takes into account the particular inference system used by the FRBS through the inferred output (which is also important since it could also affect the semantic interpretability of RB).

The way to calculate each Rmi(R_i) is as follows:

A FRBS input must be defined by the cores, or the cores of the α-cuts, as the n-dimensional fuzzy set of the membership functions in the n antecedents of R_i. In this contribution we directly consider the cores (α=1.0).
Estimate the output considering the input generated in the previous step by inferring with the whole FRBS.
Compute the Rmi(R_i) value as the values that matches the estimated output and the R_i consequent membership function: thus measuring how different the system and the local Ri outputs are.

Rmi is defined in the range [0,1], where 0 indicates the lowest level of reliability and 1 the highest. The complete description, formulation and some examples of how to compute Rmi can be found in (Galende et al 2014 ).

B. Experimental Study

This section includes the experimental study of the proposed method. The experimentation is undertaken with 23 real-world datasets, with a number of variables within the interval [2, 60] and a number of examples within the interval [43, 4177]. In all the experiments, a 5-fold cross-validation model (5fcv) has been adopted, i.e., the data-set has been split randomly into 5 folds, each one containing 20% of the patterns of the data-set. Thus, four folds have been used for training and one for testing. The properties of these datasets are presented in Table I: name of the dataset (NAME), short name or acronym of the dataset (ACRO), number of variables (VAR), and number of examples (CASES). For each data-set, the number of cases and the number of variables is shown. You may download the 5-fold cross-validation partitions for all the datasets in the KEEL format by clicking here.

These datasets have been downloaded from the following web pages:

Table I: Properties of the datasets

NAME	ACRO	VAR	CASES
Abalone	ABA	8	4177
Anacalt	ANA	7	4052
Baseball	BAS	16	337
Boston housing	BOS	13	506
Diabetes	DIA	2	43
Machine CPU	CPU	6	209
Electrical Maintenance	ELE	4	1056
Body fat	FAT	14	252
Forest Fires	FOR	12	517
Friedman	FRI	5	1200
Mortgage	MOR	15	1049
Auto Mpg 6	MPG6	5	392
Auto Mpg 8	MPG8	7	392
AutoPrice	PRI	15	159
Quake	QUA	3	2178
Stocks domain	STP	9	950
Strike	STR	6	625
Treasury	TRE	15	1049
Triazines	TRI	60	186
Weather Ankara	WAN	9	1609
Weather Izmir	WIZ	9	1461
Wisconsin Breast Cancer	WBC	32	194
Yacht Hydrodynamics	YH	6	308

C. Analysis of some linguistic model examples

In this subsection, we include some representative examples of the linguistic models obtained in two of the benchmark problems used for comparison in the previous subsections: WAN (Weather in Ankara) and WBC (Wisconsin Breast Cancer). Figures 2 and 3 depict both models in order to demonstrate not only the accuracy of the method but also the simplicity and easy readability of the rules obtained. Variables in these figures are ordered to represent the same order of splits in the tree generated when learning the rules. Thus, we can consider that each split is a way to recognize the different divisions of the data, from the most general to the most specific.

We have used colors to ease the recognition of the different cases represented by the rules (same color per variable and split). Gray texts are only included to provide additional information, but this information is actually not a part of the rule structure proposed (and therefore it is not needed for inference or for understanding). It is the same for the percentage of covered instances, the Gm3m and the Rmi values, since they have a purely informative purpose and explain the semantic quality of each partition and rule, respectively. As we previously explained, Rmi goes from 1.0 (representing that what a single rule affirms in its main covering region is equal to what the model produces) to 0.0 (representing that what a single rule affirms is completely different to what the model produces). In general, we can see that almost all the rules have qualified with an Rmi equal to 1.0, which indicates (together with the high Gm3m values) that these rules do not interfere significantly with each other, and so each rule locality is preserved. Finally, please take into account that our initial linguistic partitions are strong (which are accepted in the specialized literature as being highly interpretable), and that Gm3m values near 0.8 indicate that their meanings are preserved at a high level (see an example in Figure 1 with the Gm3m values reported in Figure 2). Again, instead of only including the linguistic terms, we are providing the definition points of the membership functions as additional information. This is because the expert in our real case study (the childhood obesity problem) asked us about these numbers after analyzing the rules to check the approximated division values, so we think that they might be of insterest for an expert in any potential problems. Please, skip these numbers if you are not really an expert on the given problem, and remember that the corresponding linguistic terms come from a strong linguistic partition.

Figure 1: Example of a linguistic partition with Gm3m equal to 0.81 (blue), with respect to the corresponding strong fuzzy partition (gray)

Figure 2: KB obtained with the method proposed in the WAN dataset. MSETst obtained is 1.565. For more quality see the pdf here

Figure 2 shows the DB and RB obtained for the WAN dataset (Estimation of average temperature from measured climate factors), whose accuracy (MSETst) obtained is 1.565. The First division (by MinTemp) achieves three different situations depending on the minimum temperature values (the coldest, medium and hottest situations). Taking into account the easiest one (R5, hottest), it determines that when minimum temperatures are very high, the mean temperature should be high (centered on 68.6°F) and move up (or down) depending on the maximum temperature by 0.71 per degree over (or under) 82.2°F. In the cases where the minimum temperature is medium (R3 and R4), we find two different situations depending on the dew point. Where the dew point has a temperature of high or more, the mean temperature should be medium (centered on 57.7°F). And where the dew point is up to medium, the mean temperature should be a little less, i.e. between low and medium (centered on 34.2°F). In both cases, variability is once again explained in the maximum temperature variations, where depending on the dew point, we can see how these maximum temperatures move in different ranges (MaxTemp is centered at 55.5 or 39.9, respectively, to which higher or lower values are be added or subtracted). At this point, we can see also that variability, depending on the maximum temperatures, is higher in the R5 case than it is in theR3 and R4 cases (when temperatures are high, in general, changes to the maximum temperature have a greater effect on the mean value estimation). This kind of relative information among consequent factors cannot be found (or is not easy to find) in the models obtained using the classic linguistic rules, which makes it a new, additional and useful piece of information that has never been seen before in previous linguistic fuzzy proposals. Finally, the cases where minimum temperatures are low (R1 and R2, the coldest cases) could be analyzed in the same manner by taking into account that both rules are dependent on visibility (whether it's a clear day or not) and vary on different factors (on maximum temperatures for clear days or on dew point for foggy days).

Figure 3: KB obtained with the method proposed in the dataset WBC. MSETst obtained is 640.9. For more quality see the pdf here

The linguistic model obtained for the WBC dataset (predicting the months when breast cancer is likely to recur, based on the characteristics of individual cells taken from images) is shown in Figure 3. The obtained model is quite interesting, as with only 3 rules it obtains very precise results with respect to those obtained by the methods in our comparisons. In this case, we leave the interpretation up to the reader, who should take into account that: The texture of the cell nucleus is measured on the variance of the gray scale intensity (i.e., the higher the uglier, thus meaning that they are more malignant); and Fractal Dimension is the approximation to the coastline (i.e., the higher the closer the approximation, so that contours are more regular and therefore more benign). R3 represents the cases with the highest severity, R2 represents the intermediate cases and R1 the cases with the least severity.

D. Comparison with some state-of-the-art general purpose "accuracy oriented" methods

While accuracy is not the main focus of the article, the proposed algorithm has also been compared to some highly accurate state-of-the-art algorithms **, in order to help the readers appreciate the achieved accuracy as compared to other methods in the literature (simply for benchmarking purposes). The representative algorithms that we consider in this contribution are shown in Table II. This table shortly describes these algorithms and provides their corresponding literature reference. In relation to the algorithmic parameters, we are considering the standard parameters recommended by authors (those included in each tool as recommended parameters by default). However, the number of total trees in the Random Forest based algorithm is not 500 by default. Setting this value to 500 improved the results systematically, and as significant improvements were not observed beyond this value, it was set to 500 for this comparison. Finally, since our MSE is divided by 2, we multiplied our results by 2 to perform this comparison.

**They are available via recognized software tools such as:

JSAT: Java Statistical Analysis Tool, a Library for Machine Learning available via the link

R: A Language and Environment for Statistical Computing available via the link

Scikit-learn: Machine Learning in Python available via this link

Matlab M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave available via this link

Table II: Algorithms considered as representatives of more accurate not transparent approaches

Algorithm Type	Cite	Reference	Description
Model Trees (MT)	M5PrimeLab: M5' regression tree, model tree, and tree ensemble toolbox for Matlab/Octave.	M5PrimeLab code	M5 prime regression method implementation
Neural Networks (NNET)	Adam: A Method for Stochastic Optimization.	Diederik et al 2017	MLP squared-loss stochastic gradient (100 hidden neurons)
Random Forests (RF)	Gene selection with guided regularized random forest.	Deng et al 2013	Regularized random forest algorithm with 500 trees
Support Vector Machines (SVM)	Large-Scale Linear Support Vector Regression.	Ho et al 2012	Dual coordinate descent for large-scale linear SVM

These algorithms and the 23 regression datasets are publicly available, so for the sake of simplicity we will directly provide the statistical test results. Table III shows the rankings using Friedman's test of the different methods considered in this study in test error. In this case, the proposed algorithm is ranked second behind RF, which seems to have performed quite well.

Table III: Algorithms considered as representatives of more accurate non-transparent approaches

Algorithm	Ranking
RF	1.609
Proposed method	2.174
MT	3.174
SVM	3.522
NNET	4.522

Table IV shows the adjusted p-values (apv) obtained using Holm's test, and compares all the methods versus the proposed method in test error. The results show that the proposed method outperforms those methods that are ranked below it with low apvs (0.128 in the closest case). On the other hand, we can observe an apv that is quite a lot higher in comparison to RF, indicating that the results of these two approaches are not so far apart.

Table IV: Adjusted p-values using Holm's test. Proposed Method versus all on Tst.

Algorithm	apv on Tst
Proposed vs NNET	4.289E-6
Proposed vs SVM	0.023
Proposed vs MT	0.128
Proposed vs RF	0.451

As previously mentioned, while accuracy is not the main objective of the article, in our opinion, and taking into account that the proposed approach obtains less than 7 rules in all the datasets (less than 5 on average), these results also show a really competitive performance from an accuracy point of view. The proposed approach competes well with models set to more than 500 trees.

Home Rafael Alcalá Fernández
Last update: 18/06/2021	Optimized for MS-Explorer with 1024 x 768 pixel resolution

Complementary materials for the paper