An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

1Tsinghua University
2Carnegie Mellon University

Abstract

The optimal training configurations of large language models (LLMs) with respect to model sizes and compute budgets have been extensively studied. But how to optimally configure LLMs during inference has not been explored in sufficient depth. We study compute-optimal inference: designing models and inference strategies that optimally trade off additional inference-time compute for improved performance. As a first step towards understanding and designing compute-optimal inference methods, we assessed the effectiveness and computational efficiency of multiple inference strategies such as Greedy Search, Majority Voting, Best-of-N, Weighted Voting, and their variants on two different Tree Search algorithms, involving different model sizes and computational budgets. We found that a smaller language model with our novel tree search algorithm (Reward Balanced Search) typically achieves a Pareto-optimal trade-off. These results highlight the potential benefits of deploying smaller models equipped with more sophisticated decoding algorithms in end-devices to enhance problem-solving accuracy.

Inference Time Scaling

The inference computation scaling laws of Pythia models exhibited in error rate on th GSM8K test set. We evaluate Pythia model using various sizes and various numbers of sampled solutions for majority voting. The left panel shows the error rate for each model size decreases steadily when the computation increases and converges at the end. The right panel shows the model performances given inference FLOPs budgets. In particular, the three stars highlight the optimal model size under \(10^{12}\), \(10^{13}\), and \(10^{14}\) FLOPs, indicating that the optimal model size can vary given different budgets.

Overview of Reward Balanced Search (ReBaSe)


The Reward Balanced Search method inherits the exploitation and pruning properties of tree search, while using the reward model alone to estimate the nodes' qualities without additional computation for estimating values by sampling children. The efficiency is achieved by constraining the total expansion width of the tree at a certain depth. REBASE balances the expansion width among the nodes at the same depth based on the rewards given by the Process Reward Model (PRM). See our paper for more information.

REBASE Result

This figure shows the performance of different inference strategies, where x-axis represents the computation budget in flops and y-axis denotes the test error rate on MATH dataset the REBASE method consistently outperforms the sampling method and the MCTS. REBASE also saturates later than sampling methods with a higher accuracy.

Compared to sampling method, REBASE achieves higher accuracy with much less computation.