Changes in version 2.1 - The cross-validation loop is now parallelized. The functions attempt to guess a sensible number of cores to use, or the user can specify how many through new argument n.cores. - A fair amount of code refactoring. - Added type='response' for predict when distribution='adaboost'. - Fixed a bug that caused offset not to be used if the first element of offset was 0. - Updated predict.gbm and plot.gbm to cope with objects created using gbm version 1.6. - Changed default value of verbose to 'CV'. gbm now defaults to letting the user know which block of CV folds it is running. If verbose=TRUE is specified, the final run of the model also prints its progress to screen as in earlier versions. - Fixed bug that caused predict to return wrong result when distribution == 'multinomial' and length(n.trees) > 1. - Fixed bug that caused n.trees to be wrong in relative.influence if no CV or validation set was used. - Relative influence was computed wrongly when distribution="multinomial". Fixed. - Cross-validation predictions now included in the output object. - Fixed bug in relative.influence that caused labels to be wrong when sort.=TRUE. - Modified interact.gbm to do additional sanity check, updated help file - Fixed bug in interact.gbm so that it now works for distribution="multinomial" - Modified predict.gbm to improve performance on large datasets Changes in version 2.0 Lots of new features added so it warrants a change to the first digit of the version number. Major changes: - Several new distributions are now available thanks to Harry Southworth and Daniel Edwards: multinomial and tdist. - New distribution 'pairwise' for Learning to Rank Applications (LambdaMART), including four different ranking measures, thanks to Stefan Schroedl. - The gbm package is now managed on R-Forge by Greg Ridgeway and Harry Southworth. Visit http://r-forge.r-project.org/projects/gbm/ to get the latest or to contribute to the package Minor changes: - the "quantile" distribution now handles weighted data - relative.influence changed to give names to the returned vector - Added print.gbm and show.gbm. These give basic summaries of the fitted model - Added support function and reconstructGBMdata() to facilitate reconstituting the data for certain plots and summaries - gbm was not using the weights when using cross-validation due to a bug. That's been fixed (Thanks to Trevor Hastie for catching this) - predict.gbm now tries to guess the number of trees, also defaults to using the training data if no newdata is given. - relative.influence has has 2 new arguments, scale. and sort. that default to FALSE. The returned vector now has names. - gbm now tries to guess what distribution you meant if you didn't specify. - gbm has a new argument, class.stratifiy.cv, to control if cross-validation is stratified by class with distribution is "bernoulli" or "multinomial". Defaults to TRUE for multinomial, FALSE for bernoulli. The purpose is to avoid unusable training sets. - gbm.perf now puts a vertical line at the best number of trees when method = "cv" or "test". Tries to guess what method you meant if you don't tell it. - .First.lib had a bug that would crash gbm if gbm was installed as a local library. Fixed. - plot.gbm has a new argument, type, defaulting to "link". For bernoulli, multinomial, poisson, "response" is allowed. - models with large interactions (>24) were using up all the terminal nodes in the stack. The stack has been increased to 101 nodes allowing interaction.depth up to 49. A more graceful error is now issued if interaction.depth exceeds 49. (Thanks to Tom Dietterich for catching this). - gbm now uses the R macro R_NaN in the C++ code rather than NAN, which would not compile on Sun OS. - If covariates marked missing values with NaN instead of NA, the model fit would not be consistent (Thanks to JR Lockwood for noting this) Changes in version 1.6 - Quantile regression is now available thanks to a contribution from Brian Kriegler. Use list(name="quantile",alpha=0.05) as the distribution parameter to construct a predictor of the 5% of the conditional distribution - gbm() now stores cv.folds in the returned gbm object - Added a normalize parameter to summary.gbm that allows one to choose whether or not to normalize the variable influence to sum to 100 or not - Corrected a minor bug in plot.gbm that put the wrong variable label on the x axis when plotting a numeric variable and a factor variable - the C function gbm_plot can now handle missing values. This does not effect the R function plot.gbm(), but it makes gbm_plot potentially more useful for computing partial dependence plots - mgcv is no longer a required package, but the splines package is needed for calibrate.plot() - minor changes for compatibility with R 2.6.0 (thanks to Seth Falcon) - corrected a bug in the cox model computation when all terminal nodes had exactly the minimum number of observations permitted, which caused gbm and R to crash ungracefully. This was likely to occur with small datasets (thanks to Brian Ring) - corrected a bug in Laplace that always made the terminal node predictions slightly larger than the median. Corrected again in a minor release due to a bug caught by Jon McAuliffe - corrected a bug in interact.gbm that caused it to crash for factors. Caught by David Carslaw - added a plot of cross-validated error to the plots generated by gbm.perf Changes in version 1.5 - gbm would fail if there was only one x. Now drop=FALSE is set in all data.frame subsetting (thanks to Gregg Keller for noticing this). - Corrected gbm.perf() to check if bag.fraction=1 and skips trying to create the OOB plots and estimates. - Corrected a typo in the vignette specifying the gradient for the Cox model. - Fixed the OOB-reps.R demo. For non-Gaussian cases it was maximizing the deviance rather than minimizing. - Increased the largest factor variable allowed from 256 levels to 1024 levels. gbm stops if any factor variable exceeds 1024. Will try to make this cleaner in the future. - predict.gbm now allows n.trees to be a vector and efficiently computes predictions for each indicated model. Avoids having to call predict.gbm several times for different choices of n.trees. - fixed a bug that occurred when using cross-validation for coxph. Was computing length(y) when y is a Surv object which return 2*N rather than N. This generated out-of-range indices for the training dataset. - Changed the method for extracting the name of the outcome variable to work around a change in terms.formula() when using "." in formulas. Changes in version 1.4 - The formula interface now allows for "-x" to indicate not including certain variables in the model fit. - Fixed the formula interface to allow offset(). The offset argument has now been removed from gbm(). - Added basehaz.gbm that computes the Breslow estimate of the baseline hazard. At a later stage this will be substituted with a call to survfit, which is much more general handling not only left-censored data. - OOB estimator is known to be conservative. A warning is now issued when using method="OOB" and there is no longer a default method for gbm.perf() - cv.folds now an option to gbm and method="cv" is an option for gbm.perf. Performs v-fold cross validation for estimating the optimal number of iterations - There is now a package vignette with details on the user options and the mathematics behind the gbm engine. Changes in version 1.3 - All likelihood based loss functions are now in terms of Deviance (-2*log likelihood). As a result, gbm always minimizes the loss. Previous versions minimized losses for some choices of distribution and maximized a likelihood for other choices. - Fixed the Poisson regression to avoid predicting +/- infinity which occurs when a terminal node has only observations with y=0. The largest predicted value is now +/-19, similar to what glm predicts for these extreme cases for linear Poisson regression. The shrinkage factor will be applied to the -19 predictions so it will take 1/shrinkage gbm iterations locating pure terminal nodes before gbm would actually return a predicted value of +/-19. - Introduces shrink.gbm.pred() that does a lasso-style variable selection Consider this function as still in an experimental phase. - Bug fix in plot.gbm - All calls to ISNAN now call ISNA (avoids using isnan) Changes in version 1.2 - fixed gbm.object help file and updated the function to check for missing values to the latest R standard. - gbm.plot now allows i.var to be the names of the variables to plot or the index of the variables used - gbm now requires "stats" package into which "modreg" has been merged - documentation for predict.gbm corrected Changes in version 1.1 - all calculations of loss functions now compute averages rather than totals. That is, all performance measures (text of progress, gbm.perf) now report average log-likelihood rather than total log-likelihood (e.g. mean squared error rather than sum of squared error). A slight exception applies to distribution="coxph". For these models the averaging pertains only to the uncensored observations. The denominator is sum(w[i]*delta[i]) rather than the usual sum(w[i]). - summary.gbm now has an experimental "method" argument. The default computes the relative influence as before. The option "method=permutation.test.gbm" performs a permutation test for the relative influence. Give it a try and let me know how it works. It currently is not implemented for "distribution=coxph". - added gbm.fit, a function that avoids the model.frame call, which is tragically slow with lots of variables. gbm is now just a formula/model.frame wrapper for the gbm.fit function. (based on a suggestion and code from Jim Garrett) - corrected a bug in the use of offsets. Now the user must pass the offset vector with the offset argument rather than in the formula. Previously, offsets were being used once as offsets and a second time as a predictor. - predict.gbm now has a single.tree option. When set to TRUE the function will return predictions from only that tree. The idea is that this may be useful for reweighting the trees using a post-model fit adjustment. - corrected a bug in CPoisson::BagImprovement that incorrectly computed the bagged estimate of improvement - corrected a bug for distribution="coxph" in gbm() and gbm.more(). If there was a single predictor the functions would drop the unused array dimension issuing an error. - corrected gbm() distribution="coxph" when train.fraction=1.0. The program would set two non-existant observations in the validation set and issue a warning. - if a predictor variable has no variation a warning (rather than an error) is now issued - updated the documentation for calibrate.plot to match the implementation - changed the some of the default values in gbm(), bag.fraction=0.5, train.fraction=1.0, and shrinkage=0.001. - corrected a bug in predict.gbm. The C code producing the predictions would go into an infinite loop if predicting an observation with a level of a categorical variable not seen in the training dataset. Now the routine uses the missing value prediction. (Feng Zeng) - added a "type" parameter to predict.gbm. The default ("link") is the same as before, predictions are on the canonical scale (gradient scale). The new option ("response") converts back the same scale as the outcome (probability for bernoulli, mean for gaussian, etc.). - gbm and gbm.more now have verbose options which can be set to FALSE to suppress the progress and performance indicators. (several users requested this nice feature) - gbm.perf no longer prints out verbose information about the best iteration estimate. It simply returns the estimate and creates the plots if requested. - ISNAN, since R 1.8.0, R.h changed declarations for ISNAN(). These changes broke gbm 1.0. I added the following code to buildinfo.h to fix this #ifdef IEEE_754 #undef ISNAN #define ISNAN(x) R_IsNaNorNA(x) #endif seems to work now but I'll look for a more elegant solution. Changes in version 0.8 - Additional documentation about the loss functions, graphics, and methods is now available with the package - Fixed the initial value for the adaboost exponential loss. Prior to version 0.8 the initial value was 0.0, now half the baseline log-odds - Changes in some headers and #define's to compile under gcc 3.2 (Brian Ripley) Changes in version 0.7 - gbm.perf, the argument named best.iter.calc has been renamed "method" for greater simplicity - all entries in the design matrix are now coerced to doubles (Thanks to Bonnie Ghosh) - now checks that all predictors are either numeric, ordinal, or factor - summary.gbm now reports the correct relative influence when some variables do not enter the model. (Thanks to Hugh Chipman) - renamed several #define'd variables in buildinfo.h so they do not conflict with standard winerror.h names. Planned future changes 1. Add weighted median functionality to Laplace 2. Automate the fitting process, ie, selecting shrinkage and number of iterations 3. Add overlay factor*continuous predictor plot as an option rather than lattice plots 4. Add multinomial and ordered logistic regression procedures Thanks to RAND for sponsoring the development of this software through statistical methods funding. Kurt Hornik, Brian Ripley, and Jan De Leeuw for helping me get gbm up to the R standard and into CRAN. Dan McCaffrey for testing and evangelizing the utility of this program. Bonnie Ghosh for finding bugs. Arnab Mukherji for testing and suggesting new features. Daniela Golinelli for finding bugs and marrying me. Andrew Morral for suggesting improvements and finding new applications of the method in the evaluation of drug treatment programs. Katrin Hambarsoomians for finding bugs. Hugh Chipman for finding bugs. Jim Garrett for many suggestions and contributions.
Generated by dwww version 1.15 on Thu Jun 27 23:20:20 CEST 2024.