Accelerating Decision Tree-Based Predictive Analytics

Yannick Gatscha
October 26, 2021

Gradient boosting frameworks such as XGBoost, LightGBM and CatBoost, as well as Random Forest algorithms are often a part of winning machine learning models in Kaggle competitions (especially the former frameworks). These frameworks and algorithms are also widely used techniques in recommender systems, search engines and payment platforms.

XGBoost, LightGBM, CatBoost, and Random Forest have another commonality: They are all based on learned decision tree ensembles. Such decision trees are fed with training data in order to teach them to ask the right questions about a data set: For example, if the decision tree shall predict whether a user will like a certain movie recommended to him on a website, the tree learns which features of the movie (i.e. the data set) are relevant to the user. After the training phase, new data are applied to the decision tree in order to make predictions autonomously without human interaction (prediction phase). Using ensembles of many decision trees with a voting mechanism instead of a single tree greatly boosts the prediction accuracy and prevents overfitting on the training data, which is the central rationale behind XGBoost, LightGBM, CatBoost and Random Forest.


In this post, we take a closer look at the prediction phase. Think of an XGBoost model consisting of thousands of learned decision trees. If this model is exposed through a web service to make predictions for user queries, the request rate can amount to thousands of simultaneous model executions. In such interactive scenarios, there are two performance metrics in addition to the prediction accuracy:

  • the sustained throughput (simultaneous queries)
  • the response time of a single query

To meet throughput constraints, the concurrent model executions are often spread over several machines. In high-throughput cases, this can result in a distribution of the workload over many servers which operate in parallel. Running large server clusters on-premises in a data center or in the cloud results in high operational costs.

On the other hand, guaranteeing a maximum response time (e.g. 100ms) per query becomes increasingly difficult as the server clusters grow (due to latency uncertainties for data movement, scheduling, etc.).

In this study, we have investigated how both throughput and response time can be optimized in other ways than distributing the workload across large server clusters. Specifically, we were looking for ways to reduce server cluster sizes (and hence save operational costs) under a guaranteed response time constraint. To this end, we offload the execution of decision tree ensembles to hardware accelerators, in particular to data center-grade Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). Both types of accelerators are available in major public cloud services such as AWS. We compare both against the reference case in which the workload is run on CPU-only servers.

In this benchmark, we used the Xelera Suite Acceleration Software to perform the FPGA tests. A similar study has been published by Amadeus IT Group recently and can be found here.


The test setup is shown in the figure below. For the reference case based on CPUs scikit-learn was selected because it provides excellent performance for many machine learning algorithms. RAPIDS is a GPU-optimized machine learning library provided by NVIDIA. Xelera provides machine learning kernels for FPGAs accordingly as well as a python interface, which is also based on the scikit-learn API.

The test data set is a database for movie recommendations ( However, the above benchmarking methodology can be applied to any other data set. We run the tests for different model sizes (number of parallel decision trees per ensemble) and different throughput constraints (number of concurrent queries).


We tested in two environments. The first is an on-premises setup. The second test compares CPU-only instances with GPU- and FPGA-enhanced instances in the AWS cloud. We specify a system with 10,000 decision trees (e.g. 100 different models with 100 trees each, or 10 different models with 1,000 trees each), and with the following constraints on the sustained throughput and the per-query response time:

  • 10,000 queries per second sustained throughput
  • 100ms response time per single query

In this context, a query means applying an input sample (e.g.a sample received from a web service interface) to all of the 10,000 decision trees.


In this test, a single Intel Xeon D-2183IT CPU is compared against a single NVIDIA Tesla T4 GPU and a single Xilinx Alveo U50 FPGA card. All platforms were hosted in a Lenovo ThinkSystem SE350 Edge server. All measured times include the overhead for transferring the data and results over PCIe to the accelerators.


In this test, a CPU-only AWS C5.12xlarge instance is compared against a p3.2xlarge (NVIDIA Tesla V100 GPU), a g4dn.xlarge (NVIDIA Tesla T4 GPU), and a f1.2xlarge (Xilinx Virtex UltraScale+ VU9P FPGA) instance. Instead of repeating the same table as above, we have compared the cost efficiency for each hardware platform in terms of million queries per Dollar. To this end, we have calculated how many instances are required to meet the above throughput constraint and attached the AWS pricing accordingly (


Operating decision tree ensembles in the prediction mode with high throughput requirements can be computationally demanding. GPU accelerators show a noticeable improvement of speed and cost efficiency over CPU-only scale-out solutions. However, FPGA accelerators provide the most significant improvement over CPU and GPU platforms: We see two orders of magnitude speed-up over CPUs and GPUs and one order of magnitude cost savings over CPUs and GPUs. The secret as to why FPGAs perform so well on this class of workloads is their unique memory architecture, which consists of thousands of independent blocks of on-chip memory. This memory is not only highly parallel a key difference to the GPU memory is that it can handle highly parallel, irregular memory accesses very well.


Xelera offers the plugin for the accelerated decision tree ensembles as part of Xelera Suite to customers. The software is available for on-premises as well as cloud deployments. Xelera Suite provides the integration into standard machine learning frameworks in order to allow users without any knowledge of accelerator technology to leverage the benefits shown above. The table below shows the current status of framework support.

We will provide an API to give trial users access to the acceleration software on the public cloud in due course. A more detailed analysis of the performance results is here. Stay tuned and contact us at if you are interested.


Further articles you might like

Recognizing Voices With AI

Voice-based digital assistants are on the rise. Xelera Technologies provides an AI module for speech processing systems that distinguishes voices and gives...

Read more

Do you want your data center to be greener?

The European Union wants data centres to be greener. Following the EU publication “Shaping Europe’s Digital Future”, Telecommunications and data centres have a significant environmental footprint...

Read more