Benchmarking the optim function

NOTE : Below post is valid for Package version 1.4.0 and Before.

Estimating the shape parameters of Beta-Binomial Distribution

Introduction

I wrote a blog post earlier this month to understand the optimization functions in R and compare them. Here I am taking my time to go through one function at a time, only when there are more than one analytical method to use.

Today I will scrutinize the optim function which has six analytical methods. Setting the stage, we have the Beta-Binomial distribution and Binomial Outcome data, and we need to estimate proper shape parameters which would minimize the Negative Log Likelihood value or Maximize the Log Likelihood value of the Beta-Binomial distribution for the above Binomial Outcome data. In this case Alcohol Consumption data from the fitODBOD package will be used.

In this blog post we focus on the process time to optimization, estimated shape parameters, minimized Negative Log Likelihood value, expected frequencies, p-value and Over-dispersion in tables.

Below are the six analytical methods in concern

  1. Nelder Mead
  2. BFGS
  3. CG
  4. L-BFGS-B
  5. SANN
  6. Brent

Alcohol Consumption data has two sets of frequency values but only values from week 1 will be used. Below is the the Alcohol Consumption data, where number of observations is 399 and the Binomial Random variable is a vector of values from zero to seven.

library(fitODBOD)
kable(Alcohol_data,"html",align=c('c','c','c')) %>%
  kable_styling(bootstrap_options = c("striped"),font_size = 14,full_width = F) %>%
  row_spec(0,color = "blue") %>%
  column_spec(1,color = "red")
Days week1 week2
0 47 42
1 54 47
2 43 54
3 40 40
4 40 49
5 41 40
6 39 43
7 95 84

Brief of optim Function

Reading through the optim function brief from the previous post it will help the reader regarding operational questions of the function.

So for the initial parameters of a=0.1 and b=0.2 we will be finding estimated parameters from different analytical methods which would minimize the Negative Log Likelihood value of the Beta-Binomial distribution.

First we are transforming the given EstMLEBetaBin function to satisfy the optim function conditions.

# new function to facilitate optim criteria
# only one input but has two elements
foroptim<-function(a)
  {
  EstMLEBetaBin(x=Alcohol_data$Days, freq=Alcohol_data$week1,a=a[1],b=a[2])
  }

So the foroptim function can be used as above and parameters are estimated for \(\alpha\) and \(\beta\) (or a, b) for the Alcohol Consumption data week 1. Further the optim function can be scrutinized as below.

  • package : stats
  • No of Inputs: 7
  • Minimum required Inputs : 2
  • Class of output : list
  • No of outputs: 5
  • No of Analytical Methods : 6
  • Default Method : Nelder-Mead

Nelder and Mead method

Default analytical method is Nelder and Mead method. According to the documentation it uses only function values and is robust but relatively slow. It will work reasonably well for non-differential functions.

Reference : Nelder, J.A. and Mead, R., 1965. A simplex method for function minimization. The computer journal, 7(4), pp.308-313.

Below is the code of using optim function with Nelder and Mead analytical method.

# optimizing values for a,b using default analytical method or Nelder and Mead
NM_answer<-optim(par=c(0.1,0.2),fn=foroptim)

# the outputs
NM_answer$par # estimated values for a, b
NM_answer$value # minimized function value 
NM_answer$counts  # see the documentation to understand
NM_answer$convergence # indicates successful completion
NM_answer$message # additional information

# fitting the Beta-Binomial distribution with estimated shape parameter  values
fitBetaBin(Alcohol_data$Days,Alcohol_data$week1,NM_answer$par[1],NM_answer$par[2])

BFGS method

The documentation indicates that BFGS is a Quasi-Newton method (also known as a variable metric algorithm), specifically that published simultaneously in 1970 by Broyden, Fletcher, Goldfarb and Shanno. This uses function values and gradients to build up a picture of the surface to be optimized.

Reference : Broyden, C.G., 1967. Quasi-Newton methods and their application to function minimization. Mathematics of Computation, 21(99), pp.368-381.

Below is the code for using optim function with BFGS analytical method

# optimizing values for a,b using BFGS inputs
BFGS_answer<-optim(par=c(0.1,0.2),fn=foroptim,method = "BFGS")

# the outputs
BFGS_answer$par # estimated values for a, b
BFGS_answer$value # minimized function value 
BFGS_answer$counts  # see the documentation to understand
BFGS_answer$convergence # indicates successful completion
BFGS_answer$message # additional information

# fitting the Beta-Binomial distribution with estimated shape parameter  values
fitBetaBin(Alcohol_data$Days,Alcohol_data$week1,
           BFGS_answer$par[1],BFGS_answer$par[2])

CG method

The documentation indicates the Method CG is a conjugate gradients method based on that by Fletcher and Reeves (1964) (but with the option of Polak–Ribiere or Beale–Sorenson updates). Conjugate gradient methods will generally be more fragile than the BFGS method, but as they do not store a matrix they may be successful in much larger optimization problems.

Reference : Fletcher, R. and Reeves, C.M., 1964. Function minimization by conjugate gradients. The computer journal, 7(2), pp.149-154.

Using CG method with optim function is explained below

# optimizing values for a,b using CG inputs
CG_answer<-optim(par=c(0.1,0.2),fn=foroptim,method = "CG")

# the outputs
CG_answer$par # estimated values for a, b
CG_answer$value # minimized function value 
CG_answer$counts  # see the documentation to understand
CG_answer$convergence # indicates successful completion
CG_answer$message # additional information

# fitting the Beta-Binomial distribution with estimated shape parameter  values
fitBetaBin(Alcohol_data$Days,Alcohol_data$week1,CG_answer$par[1],CG_answer$par[2])

L-BFGS-B method

Method L-BFGS-B is that of Byrd et. al. (1995) which allows box constraints, that is each variable can be given a lower and/or upper bound. The initial value must satisfy the constraints. This uses a limited-memory modification of the BFGS quasi-Newton method. If non-trivial bounds are supplied, this method will be selected, with a warning.

Reference : Byrd, R.H., Lu, P., Nocedal, J. and Zhu, C., 1995. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5), pp.1190-1208.

Refer the below code chunk to under the L-BFGS-B method from optim function

# optimizing values for a,b using L-BFGS-B inputs
L_BFGS_B_answer<-optim(par=c(0.1,0.2),fn=foroptim,method = "L-BFGS-B")

# the outputs
L_BFGS_B_answer$par # estimated values for a, b
L_BFGS_B_answer$value # minimized function value 
L_BFGS_B_answer$counts  # see the documentation to understand
L_BFGS_B_answer$convergence # indicates successful completion
L_BFGS_B_answer$message # additional information

# fitting the Beta-Binomial distribution with estimated shape parameter  values
fitBetaBin(Alcohol_data$Days,Alcohol_data$week1,
           L_BFGS_B_answer$par[1],L_BFGS_B_answer$par[2])

SANN method

Method SANN is by default a variant of simulated annealing given in Belisle (1992). Simulated-annealing belongs to the class of stochastic global optimization methods. It uses only function values but is relatively slow. It will also work for non-differential functions. This implementation uses the Metropolis function for the acceptance probability.

By default the next candidate point is generated from a Gaussian Markov kernel with scale proportional to the actual temperature. If a function to generate a new candidate point is given, method SANN can also be used to solve combinatorial optimization problems. Temperatures are decreased according to the logarithmic cooling schedule as given in Belisle (1992, p.890); specifically, the temperature is set to \(temp / log(((t-1) %/% tmax)*tmax + exp(1))\), where \(t\) is the current iteration step and temp and tmax are specifiable via control.

Note that the SANN method depends critically on the settings of the control parameters. It is not a general-purpose method but can be very useful in getting to a good value on a very rough surface.

Reference : Belisle, C.J., 1992. Convergence theorems for a class of simulated annealing algorithms on R d. Journal of Applied Probability, 29(4), pp.885-895.

Below mentioned code chunk is simply using SANN method for optim function

# optimizing values for a,b using default inputs
SANN_answer<-optim(par=c(0.1,0.2),fn=foroptim,method = "SANN")

# the outputs
SANN_answer$par # estimated values for a, b
SANN_answer$value # minimized function value 
SANN_answer$counts  # see the documentation to understand
SANN_answer$convergence # indicates successful completion
SANN_answer$message # additional information

# fitting the Beta-Binomial distribution with estimated shape parameter  values
fitBetaBin(Alcohol_data$Days,Alcohol_data$week1,
           SANN_answer$par[1],SANN_answer$par[2])

Brent method

Brent Method is for one-dimensional problems only, using optimize(). It can be useful in cases where optim() is used inside other functions where only method can be specified, such as in mle from package stats4. Brent method does not work for our situation.

Reference : Brent, R.P., 2013. Algorithms for minimization without derivatives. Courier Corporation.

Summary of Time evaluation for different Analytical methods of optim function

Below considered table will compare the system process time for different analytical methods. In order to do this time comparison it is necessary to use the benchmark function of rbenchmark package. Below written code chunk provides the output in a table form which includes the analytical methods and their respective time values. The estimation process of the parameters where each method has been replicated 1000 times to receive a more accurate table for time values.

The table is in accordance to the elapsed time value column in the ascending order. According to this we can see that least time takes to the Nelder and Mead method and most time is taken to the SANN method. These times completely depends on the Negative Log Likelihood function you need to minimize, the data you provided, the number of estimators that needs to be estimated, the complexity of the function and finally computer’s processing power.

library(rbenchmark)

Results1<-benchmark(
          "NelderMead"={ optim(par = c(0.1,0.2), fn = foroptim)},
          "BFGS"={optim(par = c(0.1,0.2), fn = foroptim,method = "BFGS")},
          "CG"={optim(par = c(0.1,0.2), fn = foroptim,method = "CG")},
          "L-BFGS-B"={optim(par = c(0.1,0.2), fn = foroptim,method = "L-BFGS-B")},
          "SANN"={optim(par = c(0.1,0.2), fn = foroptim,method = "SANN")},
          replications = 1000,
          columns = c("test","replications","elapsed",
                      "relative","user.self","sys.self"),
          order = 'elapsed'
          )

kable(Results1,"html",align = c('c','c','c','c','c','c')) %>%
  kable_styling(full_width = T,bootstrap_options = c("striped"),font_size = 14) %>%
  row_spec(0,color = "blue") %>%
  column_spec(1,color = "red")
test replications elapsed relative user.self sys.self
4 L-BFGS-B 1000 11.83 1.000 11.61 0.01
1 NelderMead 1000 11.92 1.008 11.47 0.00
2 BFGS 1000 21.55 1.822 20.84 0.03
3 CG 1000 40.03 3.384 38.81 0.05
5 SANN 1000 1573.93 133.046 1546.35 1.27

Summary of results after using the optim function for different analytical methods

After using the methods Nelder Mead, BFGS, CG, L-BFGS-B and SANN to estimate the shape parameters a, b we can use the estimated parameters in the function fitBetaBin. Using this function we can find expected frequencies for each of these analytical methods and compare p-values and over-dispersion. Further, understand if using different analytical methods had any effect on them.

According to the below table there is no significant changes between the expected frequencies, except while using SANN method. All five methods generate different Over-dispersion values after the first three decimal places. Negative Log Likelihood values and p values are same for all 5 methods until first three decimal places. This is a clear indication of it does not matter what analytical method we use the estimation will occur effectively but only efficiency will be affected.

BinomialRandomVariable Frequency NelderMead BFGS CG LBFGSB SANN
0 47 54.61 54.62 54.62 54.62 54.75
1 54 42 42 42 42 42.02
2 43 38.91 38.9 38.9 38.9 38.89
3 40 38.54 38.54 38.54 38.54 38.52
4 40 40.07 40.07 40.07 40.07 40.04
5 41 44 43.99 44 44 43.96
6 39 53.09 53.09 53.09 53.09 53.05
7 95 87.77 87.8 87.78 87.78 87.78
Total No of Observations 399 398.99 399.01 399 399 399.01
p-value 0.0902 0.0903 0.0901 0.0901 0.0901
Estimated a and b a=0.7230707 b=0.5809894 a=0.7228930 b=0.5807279 a=0.7229414 b=0.5808477 a=0.7229432 b=0.5808496 a=0.7215669 b=0.5802982
Negative Log Likelihood 813.4571 813.4571 813.4571 813.4571 813.4573
Over Dispersion 0.4340165 0.4340992 0.4340675 0.4340668 0.4344303

Final Conclusion

We had 6 methods to compare but choosing one over the other is completely harmless to the final result of estimation as seen by our tables. And our situation forces us to not use the Brent method. The only issue is time, therefore I would recommend choose the best analytical method from the optim function based on your needs of output and research objective.

THANK YOU

Related