Several packages provide SA functions specifically optimized for variable selection

Several packages provide SA functions specifically optimized for variable selection. The anneal function in package subselect, e.g., can be used for variable selec- tion in situations like discriminant analysis, PCA, and linear regression, according to the criterion employed. For LDA, this function takes the between-groups covariance matrix, the minimal and maximal number of variables to be selected, the within groups covariance matrix and its expected rank 3xFLAG PEPTIDE, and a criterion to be optimized (see below) as arguments. For the wine example above, a solution to find the optimal three-variable subset would look like this: > winesHmat <- ldaHmat(twowines.df[, -1], twowines.df[, 1]) > wines.anneal <- + anneal(winesHmat$mat, kmin = 3, kmax = 3, + H = winesHmat$H, criterion = "ccr12", r = 1) > wines.anneal$bestsets Var.1 Var.2 Var.3 Card.3 2 7 10 > wines.anneal$bestvalues Card.3 0.83281 

Repeated application (using, e.g., nsol = 10) in this case leads to the same solution every time. Rather than the direct estimates of prediction error, the anneal function uses functions of the within- and between-groups covariance matrices (Silva 2001). In this case using the ccr12 criterion, the first root of BW−1 is optimized, analogous to Fisher’s formulation of LDA in Sect. 7.1.3. As an other example, Wilk’s is given by = det(W)/ det(T) (10.6) and is (in a slightly modified form) available in the tau2 criterion. For the current case where the dimensionality of the within-covariance matrices is estimated to be one, all criteria lead to the same result. 

The new result differs from the subset from our own implementation in only one instance: variable 11, color hue, is swapped for the malic acid concentration. The reason, of course, is that both functions optimize different criteria. Let us see how the two solutions fare when evaluated with the criterion of the other algorithm. The value for the ccr12 criterion of the solution using variables 7, 10 and 11, found with our own simplistic SA implementation, can be assessed easily: > ccr12.coef((nrow(twowines.df) - 1) * var(twowines.df[, -1]), + winesHmat$H, r = 1, c(7, 10, 11)) [1] 0.82293 which, as expected, is slightly lower than that of the set consisting of variables 2, 7 and 10. Conversely, the prediction quality of the newer set is slightly worse (two misclassifications): > selection <- rep(0, ncol(twowines)) > selection[c(2, 7, 10)] <- 1 > lda.loofun(selection, twowines.df[, -1], twowines.df[, 1]) [1] 1.6807 Obviously, there are probably many sets with the same or similar values for the quality criterion of interest, and to some extent it is a matter of chance which one is returned by the search algorithm. Moreover, the number of possible quality values can be limited, especially with criteria based on the number of misclassifications. This can make it more difficult to discriminate between two candidate subsets. 

The anneal function for subset selection is also applicable in other types of problems than classification alone: e.g., for variable selection in PCA it uses a mea sure of similarity of the original data matrix and of the projections on the k-variable subspace—again, several different criteria are available. The speed and applicability in several domains are definite advantages of this particular implementation cck-8. How ever, there are some disadvantages, too: firstly, because of the formulation using covariance matrices it is hard to apply anneal to problems with large numbers of variables. Finding the most important discriminating variables in the prostate data set would stretch your computer to the limit—in fact, even the gasoline example requires the argument force = TRUE since the default is to refuse cooperation (and give a serious-looking warning) as soon as the number of variables exceeds 400. 

Secondly, the function does not allow one to submit an evaluation function, and one has to do with the predefined set—crossvalidation-based approaches such as used in the examples above cannot be implemented, increasing the danger of overfitting. Finally, it can be important to monitor the progress of the optimization, or at least keep track of the speed with which improvements are found—especially when fine tuning the SA parameters (temperature, cooling rate) one would like to have the possibility to assess acceptance rates. Currently, no such functionality is provided in the subselect package. 

One other dedicated SA approach for variable selection can be found in the caret package mentioned in Chap. 7 in the form of the safs (simulated annealing feature selection) function. This function does allow crossvalidation-based quality measures to guide the optimization, but also supports external test sets and criteria like AIC. Parallelization is supported at several different levels. 

10.3.2 Genetic Algorithms Genetic Algorithms (GAs, Goldberg 1989) manage a population of candidate solu tions, rather than one single solution as is the case with most other optimization methods. Every solution in the population is represented as a string of values, and in a process called cross-over, mimicking sexual reproduction, offspring is generated combining parts of the parent solutions. Random mutations, occurring with rela tively low frequency, ensure that some diversity is maintained in the population. The quality of the offspring is measured in an evaluation phase—again in analogy with biology, this quality is often called “fitness”. Strings with a low fitness will have no or only a low probability of reproduction, so that subsequent generations will generally consist of better and better solutions. This obvious imitation of the process of natural selection has led to the name of the technique. GAs have been applied to a wide range of problems in very diverse fields—several overviews of applications within chemistry can be found in the literature (e.g., Leardi 2001; Niazi and Leardi 2012). 

Just like with Simulated Annealing, GAs need an evaluation function to obtain fitness values for trial solutions. A step function, on the other hand, is not needed: the genetic machinery (cross-over and mutation operations) will take care of that. Several parameters need to be set, such as the size of the population cck8 for sale, the number of iterations, and the chances of crossover and mutation, but that is all. Population sizes are typically in the order of 50–100; the number of iterations in the order of several hundreds. There are some aspects, however, that are particular for GAs. The first choice we have to make is on the representation of the candidate solutions, i.e., the candidate subsets. For variable selection, two obvious possibilities present themselves: either a vector of indices of the variables in the subset, or a string of zeros and ones. For other optimization problems, e.g., non-linear fitting, real num bers can also be used. Secondly, the selection function needs to be defined. This determines which solutions are allowed to reproduce, and is the driving force behind the optimization—if all solutions would have the same probability the result would be a random search. Typical selection procedures are to use random sampling with equal probabilities for all solutions above a quality cutoff, or to use random sampling with (scaled) quality indicators as probability weights. 

The GA package (Scrucca 2013, 2017) provides a convenient and efficient tool box, supporting for binary, real-valued and permutation representations, and several standard genetic operators. In addition, users can define their own operators. Parallel evaluation of population members is supported (especially useful if the evaluation of a single solution takes some time), and to speed up proceedings even further, local searches can be allowed at random intervals to inject new and useful information in the population. Finally, populations can be “seeded”, i.e., one can provide one or more solutions that are thought to be approximately correct.

Comments

Popular posts from this blog

Identification of monensin as a potent MYB inhibitor. A. Schematic illus- tration of the HEK-MYB-Luc

Prodigiosin induces autop- hagic alterations in human colon cancer cells

MiR-342–3p is downregulated in PDAC tissues