How should I set my experiment budget?

Setting an experiment budget is essential to the efficiency and efficacy of your SigOpt experiment.

  • Setting a budget informs the optimizer about how to trade off between exploring the hyperparameter space and exploiting the information it already has to resolve the global optimum.
  • SigOpt’s optimizer expects the budget you set to be the minimum number of observations you will report
  • You may continue to report runs beyond the original budget. In this case, SigOpt will continue to provide run suggestions. However, SigOpt will assume that the next sample will always be the last one, and provide solutions that will be exploitative in nature.

How to determine your Experiment Budget

Choosing the correct budget can be both model and use-case dependent. We provide guidelines below that give you a feel for how the choice of observation budget is affected by other SigOpt features. These guidelines are based on empirical experiments and years of providing guidance to customers. However, these should be viewed as a starting point from which you can deviate as you become more comfortable with SigOpt. More importantly, if the guidelines below lead you to setting an observation budget of 100 but you only have enough resources or time for 50, then do 50.

Baseline

Let’s define a simple SigOpt experiment as follows:

  • Optimizing a single metric
  • Sequential optimization (i.e. parallel_bandwidth = 1)
  • Only integer and continuous parameters
  • No use of prior beliefs or manual runs

Assume that you want to set up a simple SigOpt experiment with D dimensions. We would recommend setting the budget to a number between 10D and 20D and adjusting up or down in future experiments of the same problem as you see fit.

Example: If you have a 10-dimensional hyperparameter space where all parameters are continuous (real-valued numbers/floats) or integer-valued, we recommend creating your experiment with a budget of at least 100 (10 x #parameters). This can be adjusted in future experiments when the modeler has built up an intuition about roughly how many iterations SigOpt will take to find a sufficiently high-performing hyperparameter configuration for the given problem.

Categoricals

Because elements of categorical parameter spaces lack cardinal and ordinal relationships, the combinatorial nature of the problem leads to a recommendation of a larger budget. To better understand how to set the budget in this context, we first need to define the categorical breadth of an experiment. The categorical breadth of your experiment is the total number of categorical values you defined across all your categorical parameters. For example, if your experiment has two categorical parameters, a first one with 4 possible values and a second one with 5 possible values, your experiment categorical breadth is 9.

Example: If you start from the baseline experiment and substitute two of the baseline parameters with categorical parameters that have a combined categorical breadth of 9, we recommend you add an additional 135 runs to your budget (15 x categorical breadth) for a total of 235 observations.

Multimetric

When you run a multimetric experiment, SigOpt generates an efficient Pareto frontier that trades off between the two metrics. Effectively resolving this frontier requires the algorithm underlying the optimizer to do more function evaluations than it does to optimize over one of the functions by itself.

Example: If you start from the baseline and add a second metric over which you wish to optimize via SigOpt’s multimetric feature, we then recommend that you increase your budget by at least 3x to 300 runs.

Multisolution

Effectively resolving multisolution experiments requires additional observations in order to resolve solutions in multiple regions of interest.

Example: If you choose to find two solutions and the rest of the experiment is defined as in the baseline, we recommend increasing the budget to 150. If you choose to find three solutions and the rest of the experiment is defined as in the baseline, we recommend increasing the budget to 200. The formula we use to arrive at these budgets is (1 + #solutions) / 2, where #solutions is the number of solutions you selected when creating the experiment.

Conditionals

Using conditionals parameters is of interest in the case where different branches of the conditional have different corresponding hyperparameters. A good example would be the various Stochastic Gradient Descent optimizer variants available in most deep learning frameworks, each of which may take a different set of hyperparameters. Effectively resolving a conditional experiment requires an increase in budget because the optimizer is now effectively exploring a hyperparameter space where the dimensions along which the conditional decision takes place are disjoint, and each branch may interact somewhat differently with other parameters.

Example: If you choose to update your baseline experiment and add a condition with two branches, we recommend increasing the budget to 200 (2x #conditions).

Parallelism

Using parallelism lets you run an experiment on several machines at once. How to account for parallelism depends on the goal you want to achieve as illustrated in the two examples below,

  • If your goal is to use the extra compute to achieve maximum gains, multiply the baseline budget by the number of workers. Example: if you parallelize a baseline of 10 workers, set a budget of 1000 (budget * # num_workers)
  • If your goal is to use the extra compute to get the same result in less wall_clock time, use the following formula (budget * (1 + log2(num_workers)/2))). Example: for that goal, still parallelizing the baseline against 10 workers, the recommended budget becomes 332 (i.e. 100 * (1 + log2(10/2)))

Metric Thresholds and Metric Constraints

When using metric thresholds and metric constraints, SigOpt requires additional observations to understand the associated feasible parameter space.

Example 1: If you choose to set a metric threshold and the rest of the experiment is defined as in the baseline, we recommend increasing the budget to 125 observations (1.25 x #metric thresholds x #runs). If the thresholds are set too tight, then more runs are needed.

Example 2: If you choose to set three metric constraints and the rest of the experiment is defined as in the baseline, we recommend increasing the budget to 175 runs(1.25 x #metric constraints x #runs).

Prior Beliefs

Users often have prior beliefs on how metric values might behave for certain parameters; this knowledge could be derived from domain expertise, similar models trained on different datasets, or certain known structure of the problem itself. SigOpt can take advantage of these prior beliefs on parameters to make the optimization process more efficient.

1 Like

Thanks for posting such a detailed explanation, it is extremely useful :slight_smile:

I have a few questions about this topic (I hope this is the right place to post them, let me know if I am wrong):

  1. Does the type of the parameters impact in any way the difficulty of the optimization process? Should we take this matter into account when setting our experiment budget? I would expect that dealing with parameters of type int instead of type double would ease the optimization process. In a similar vein, and for parameters of type double, reducing the precision or enabling the log transformation should also reduce the difficulty of finding the best value, shouldn’t it? :thinking:

  2. For parameters of type double or int, shouldn’t we take into account the width of the specified range when setting our experiment budget? The wider the range, the more difficult should be to find the best value, right?

  3. Do parameter constraints affect in any way the budget of our experiments?

  4. Could you give us more details about why we should test more observations (100 vs 1000 vs 332 in your example) when using parallelism (even for getting the same result in less wall_clock time)? :pray:

Thanks for all the questions, this is absolutely the right place to post them :slight_smile: I’ll address each of your questions in the same order you asked them.

  1. The type of parameters within a given experiment definitely affects the way in which SigOpt approaches the optimization problem. I wouldn’t say that one is necessarily “more difficult” than the other, but especially in the case of categorical variables, types are very important to select carefully when setting up a search. While int and double parameters are treated much more similarly, categoricals do not possess ordinal relationships that we can exploit and learn as effectively from, which is why we suggest a higher experimental budget, and enforce a more stringent constraint on the number of variables you can search through. As you mentioned in your assumption, defining a log transformation when the distribution of the parameter is known would be “less difficult” in the sense that SigOpt would most probably find a performant solution in fewer iterations. Check out our docs here on how to apply prior distribution knowledge to your SigOpt experiment.

  2. Technically yes, the larger the range, the larger the space SigOpt will have to search through to find optimal areas. However, we are much more impacted by the dimension of the problem, rather than the range of any particular parameter. Of course, if you already know with some certainty where your parameters are most performant, narrowing the range will speed up the process, but assuming that you’re not setting an abnormally broad range, we should be able to learn where the “good” values are within the suggested budget.

  3. The same recommendations we give around setting standard experiments with int or double parameters should still apply to those with linear inequality constraints!

I hope these were helpful!

Could you give us more details about why we should test more observations (100 vs 1000 vs 332 in your example) when using parallelism (even for getting the same result in less wall_clock time)?

My sophomoric intuition of this is that Bayesian optimization is classically a sequential process – it uses results from previous experiments to update the belief. If we have 100 experiments performed in a series, the results from each one can inform the next experiment, and thus each one has the maximal amount of information to improve upon. The first experiment can build off the second, and third off of that, etc. So by the time we get to the 11th experiment, it has the knowledge from 10 increasingly improved experiments behind it. If however we have 100 experiments but being performed in parallel batches of 10, then the first 10 experiments are performed essentially naive, and the next 10 only having the knowledge of 10 naive experiments to build off of (instead of 10 increasingly informed experiments).
So parallelism is beneficial, but with diminishing returns – which maybe we can see represented by the logarithmic term in the stated calculations.

I guess maximizing the benefit of parallelization for Bayesian optimization is still an open area of research.

1 Like