Global Optimization Methods
Global Optimization Methods Given the speed of modern-day computing, it is possible to examine large numbers of different models and select the best one. However, as we already saw with leaps-and bounds approaches, even in cases with a moderate number of variables it is practically Fig cck8 price. 10.4 Non-zero coefficients in the lasso and elastic net models. A small vertical offset has been added to facilitate the comparison impossible to assess the quality of all subsets. One must, therefore, limit the number of subsets that is going to be considered to a manageable size. The stepwise approach does this by performing a very local search around the current best solution before adding or removing one variable; it can be compared to a steepest-descent strategy. The obvious disadvantage is that many areas of the search space will never be visited. For regression or classification cases with many variables, almost surely the method will find a local optimum, very often of low quality.
An alternative is given by random search—just sampling randomly from all pos sible subsets until time is up. Of course, the chance of finding the global optimum in this way is smaller than the chance of winning the lottery... What is needed is a search strategy that combines random elements with “gradient” information; that is, a strategy that uses information, available in solutions of higher quality, with the ability to throw that information away if needed, in order to be able to escape from local optima. This type of approaches has become known under the heading of global search strategies. The two best-known ones in the area of chemometrics are Simulated Annealing and Genetic Algorithms. Both will be treated briefly below.
What is quality cck-8 solubility, in this respect, again depends on the application. In most cases, the property of interest will be the quality of prediction of unseen data, which for larger data sets can conveniently be estimated by crossvalidation approaches lipofectamine 2000. For data sets with few samples, this will not work very well because of the coarse granularity of the criterion: many subsets will lead to an equal number of errors. Additional information should be used to distinguish between these.
In Simulated Annealing (SA, Kirkpatrick et al. 1983; Cerny 1985), a sequence of candidate solutions is assessed, starting from a random initial point. A new solution with quality Et+1, not too far away from the current one (with quality Et), is uncon ditionally accepted if it is better than the current one. If Et > Et+1 on the other hand, accepting the move corresponds to a deterioration. However, and this is the defining feature of SA, such a move can be accepted, with a probability equal to pacc = exp Et+1 − Et Tt (10.5) where Tt the state of the control parameter at the current time point t. Note that pacc, defined in this way, is always between zero and one (since Et > Et+1). This criterion is known as the Metropolis criterion (Metropolis et al. 1953). Other criteria are possible, too, but are rarely used.
The name Simulated Annealing comes from an analogy to annealing in metallurgy, where crystals with fewer defects can be created by repeatedly heating and cooling a material: during the (slow) cooling, the atoms are able to find their energetically most favorable positions in a regular crystal lattice, whereas the heating allows atoms that have been caught in unfavorable positions (local optima) to “try again” in the next cooling stage. The analogy with the optimization task is clear: if an improvement is found (better atom positions) it is accepted; if not, then sometimes a deterioration in quality is accepted in order to be able to cross a ridge in the solution landscape and to find an solution that is better in the end. Very often, the control parameter is therefore indicated with T , to stress the analogy with temperature. During the optimization, it will slowly be decreasing in magnitude—the cooling—causing fewer and fewer solutions of lower quality to be accepted. In the end, only real improvements are allowed. It can be shown that SA leads to the global optimum if the cooling is slow enough (Granville et al. 1994); unfortunately, the practical importance of this proof is limited since the cooling may have to be infinitely slow. Note that random search is a special case that can be achieved simply by setting Tt to an extremely large value, leading to pacc = 1 whatever the values of Et+1 and Et .
The naive implementation of an SA therefore can be very simple: one needs a function that generates a new solution in the neighborhood of the current one, an evaluation function to assess the quality of the new solution, and the acceptance function, including a cooling schedule for the search parameter T . The evaluation needs to be defined specifically for each problem. In regression or classification cases typically some estimate of prediction accuracy is used such as crossvalidation—note that the evaluation function in this schedule probably is the most time-consuming step, and since it will be executed many times (typically thousands or, in complicated cases, even millions of solutions are evaluated by global search methods) it should be very fast. If enough data are available then one could think of using a separate test set for the evaluation, or of using quality criteria such as Mallows’s Cp, or AIC or BIC values, mentioned in Chap. 9. The whole SA algorithm can therefore easily be summarized in a couple of steps:
1. Choose a starting temperature and state; 2. Generate and evaluate a new state; 3. Decide whether to accept the new state; 4. Decrease the temperature parameter; 5. Terminate or go to step 2. Several SA implementations are available in R. We will have a look at the optim function from the core stats package which implements a general-purpose SA func tion.
Let us see how this works in the two-class wines example from Sect. 10.1.2, excluding the Barolo variety. This is a simple example for which it still is quite difficult to assess all possible solutions, especially since we do not force a model with a specific number of variables. We will start with the general-purpose optim approach, since this provides most insight in the inner workings of the SA. First we need to define an evaluation function. Here, we use the fast built-in LOO classification estimates of the lda function: > lda.loofun <- function(selection, xmat, grouping, ...) + if (sum(selection) == 0) return(100) + lda.obj <- lda(xmat[, selection == 1], grouping, CV = TRUE) + 100*sum(lda.obj$class != grouping)/length(grouping) +
Argument selection is a vector of numbers here, with ones at the position of the selected variables, and zeroes elsewhere. Since optim by default does minimization, the evaluation function returns the percentage of misclassified cases—note that if no variables are selected, a value of 100 is returned.
Now that we have defined what exactly we are going to optimize, we need to define a step function, leading from the current solution to the next. A simple approach could be to do one of three things: either remove a variable, add a variable, or replace a variable. If too few variables are selected, we could increase the number by adding one previously unselected variable randomly (so the escape clause in the evaluation function checking for zero selected variables should never be reached). That seems easy enough to put in a function: > saStepFun <- function(selected, ...) + maxval <- length(selected) + selection <- which(selected == 1) + newvar2 <- sample(1:maxval, 2) + + ## too short: add a random number + if (length(selection) < 2) { + result <- unique(c(selection, newvar2))[1:2] + else # generate two variable numbers + presentp <- newvar2 %in% selection + ## if both are in x, remove the first + if (all(presentp)) { + result <- selection[selection != newvar2[1]] + else # if none are in selection, add the first + if (all(!presentp)) { + result <- c(selection, newvar2[1]) + else # otherwise swap + result <- c(selection[selection != newvar2[presentp]], + newvar2[!presentp]) + }} + + newselected <- rep(0, length(selected)) + newselected[result] <- 1 + newselected + }
Comments
Post a Comment