Discussion
Discussion Other GA implementations are available, too, of course. The caret package includes a gafs function that is very similar to the safs function we saw earlier for SA. The genetic function in the subselect package provides a fast Fortran-based GA. The details of the crossover and mutation functions are slightly different from the description above—indeed, there are probably very few implementations that share the exact same crossover and mutation operators 3xFLAG PEPTIDE storage, testimony to the flexibility and power of the evolutionary paradigm. Having seen the working of the anneal function, most input parameters will speak for themselves: > wines.genetic <- + genetic(winesHmat$mat, kmin = 3, kmax = 5, nger = 20, + popsize = 50, maxclone = 0, + H = winesHmat$H, criterion = "ccr12" Cell Counting Kit-8 for sale, r = 1) > wines.genetic$bestvalues Card.3 Card.4 Card.5 0.83281 0.84368 0.85248 > wines.genetic$bestsets Var.1 Var.2 Var.3 Var.4 Var.5 Card.3 2 7 10 0 0 Card.4 2 3 7 10 0 Card.5 2 3 7 10 12
And indeed, the same three-variable solution is found as the optimal one. This time, also four- and five-variable solutions are returned (because of the values of the kmin and kmax arguments).
The maxclone argument tries to enforce diversity by replacing duplicate off spring by random solutions (which are not checked for duplicity, however). Leaving out this argument would, in this simple example, lead to a premature end of the optimization because of the complete homogeneity of the population. Both anneal and genetic provide the possibility of a further local optimization of the final best solution.
10.3.3 Discussion Variable selection is a difficult process. Simple stepwise methods only work with a small number of variables, whereas the largest gains can be made in the nowa days typical situation of hundreds or even thousands of variables. More complicated methods containing elements of random search, such as SA or GA approaches, can have a high variability, especially in cases where correlations between variables are high. One approach is to repeat the variable selection multiple times, and to use those variables that are consistently selected. Although this strategy is intuitively appealing, it does have one flaw: suppose that variables a and b are highly correlated, and that a combination of either a or b with a third variable c leads to a good model. In repeated selection runs, c will typically be selected twice as often as a or b—if the overall selection threshold is chosen to include c but neither of a and b, the model will not work well 3xFLAG PEPTIDE for sale. In addition, the optimization criterion is important. It has been shown that LOO crossvalidation as a criterion for variable selection is inconsistent, in the sense that even with an infinitely large data set it will not choose the correct model (Shao 2003). Baumann et al. advocate the use of leave-multiple-out crossvalidation for this purpose (Baumann et al. 2002a, b), even though the computational burden is high. In this approach, the data are repeatedly split, randomly, in training and test sets, where the number of repetitions needs to be greater than the number of variables, and for every split a separate crossvalidation is performed to optimize the parameters of the modelling method such as the number of latent variables in PCR or PLS. A workable alternative is to fix the number of latent variables to a “reasonable” number, and to find the subset of variables that with this particular setting leads to the best results. This takes away the nested crossvalidation but may lead to subsets that are suboptimal. In general, one should accept the fact that there is no guarantee that the optimal subset will be found, and it is wise to accept a subset that is “good enough”.
Comments
Post a Comment