Accelerating Decision Tree-Based Predictive Analytics

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Preferences Deny Accept

Privacy Preference Center

When you visit websites, they may store or retrieve data in your browser. This storage is often necessary for the basic functionality of the website. The storage may be used for marketing, analytics, and personalization of the site, such as storing your preferences. Privacy is important to us, so you have the option of disabling certain types of storage that may not be necessary for the basic functioning of the website. Blocking categories may impact your experience on the website.

Reject all cookies Allow all cookies

Manage Consent Preferences by Category

Essential

Always Active

These items are required to enable basic website functionality.

Marketing

Essential

These items are used to deliver advertising that is more relevant to you and your interests. They may also be used to limit the number of times you see an advertisement and measure the effectiveness of advertising campaigns. Advertising networks usually place them with the website operator’s permission.

Personalization

Essential

These items allow the website to remember choices you make (such as your user name, language, or the region you are in) and provide enhanced, more personal features. For example, a website may provide you with local weather reports or traffic news by storing data about your current location.

Analytics

Essential

These items help the website operator understand how its website performs, how visitors interact with the site, and whether there may be technical issues. This storage type usually doesn’t collect information that identifies a visitor.

Confirm my preferences and close

Gradient boosting frameworks such as XGBoost, LightGBM and CatBoost, as well as Random Forest algorithms are often a part of winning machine learning models in Kaggle competitions (especially the former frameworks). These frameworks and algorithms are also widely used techniques in recommender systems, search engines and payment platforms.

XGBoost, LightGBM, CatBoost, and Random Forest have another commonality: They are all based on learned decision tree ensembles. Such decision trees are fed with training data in order to teach them to ask the right questions about a data set: For example, if the decision tree shall predict whether a user will like a certain movie recommended to him on a website, the tree learns which features of the movie (i.e. the data set) are relevant to the user. After the training phase, new data are applied to the decision tree in order to make predictions autonomously without human interaction (prediction phase). Using ensembles of many decision trees with a voting mechanism instead of a single tree greatly boosts the prediction accuracy and prevents overfitting on the training data, which is the central rationale behind XGBoost, LightGBM, CatBoost and Random Forest.

ACCELERATING DECISION TREE ENSEMBLES

In this post, we take a closer look at the prediction phase. Think of an XGBoost model consisting of thousands of learned decision trees. If this model is exposed through a web service to make predictions for user queries, the request rate can amount to thousands of simultaneous model executions. In such interactive scenarios, there are two performance metrics in addition to the prediction accuracy:

the sustained throughput (simultaneous queries)
the response time of a single query

To meet throughput constraints, the concurrent model executions are often spread over several machines. In high-throughput cases, this can result in a distribution of the workload over many servers which operate in parallel. Running large server clusters on-premises in a data center or in the cloud results in high operational costs.

On the other hand, guaranteeing a maximum response time (e.g. 100ms) per query becomes increasingly difficult as the server clusters grow (due to latency uncertainties for data movement, scheduling, etc.).

In this study, we have investigated how both throughput and response time can be optimized in other ways than distributing the workload across large server clusters. Specifically, we were looking for ways to reduce server cluster sizes (and hence save operational costs) under a guaranteed response time constraint. To this end, we offload the execution of decision tree ensembles to hardware accelerators, in particular to data center-grade Field-Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs). Both types of accelerators are available in major public cloud services such as AWS. We compare both against the reference case in which the workload is run on CPU-only servers.

In this benchmark, we used the Xelera Suite Acceleration Software to perform the FPGA tests. A similar study has been published by Amadeus IT Group recently and can be found here.

TEST SETUP

The test setup is shown in the figure below. For the reference case based on CPUs scikit-learn was selected because it provides excellent performance for many machine learning algorithms. RAPIDS is a GPU-optimized machine learning library provided by NVIDIA. Xelera provides machine learning kernels for FPGAs accordingly as well as a python interface, which is also based on the scikit-learn API.

The test data set is a database for movie recommendations (https://www.themoviedb.org/documentation/api). However, the above benchmarking methodology can be applied to any other data set. We run the tests for different model sizes (number of parallel decision trees per ensemble) and different throughput constraints (number of concurrent queries).

BENCHMARK RESULTS

We tested in two environments. The first is an on-premises setup. The second test compares CPU-only instances with GPU- and FPGA-enhanced instances in the AWS cloud. We specify a system with 10,000 decision trees (e.g. 100 different models with 100 trees each, or 10 different models with 1,000 trees each), and with the following constraints on the sustained throughput and the per-query response time:

10,000 queries per second sustained throughput
100ms response time per single query

In this context, a query means applying an input sample (e.g.a sample received from a web service interface) to all of the 10,000 decision trees.

ON-PREMISES BENCHMARK

In this test, a single Intel Xeon D-2183IT CPU is compared against a single NVIDIA Tesla T4 GPU and a single Xilinx Alveo U50 FPGA card. All platforms were hosted in a Lenovo ThinkSystem SE350 Edge server. All measured times include the overhead for transferring the data and results over PCIe to the accelerators.

https://xelera.io/assets/blog_posts/on-prem_performance.png

CLOUD BENCHMARK

In this test, a CPU-only AWS C5.12xlarge instance is compared against a p3.2xlarge (NVIDIA Tesla V100 GPU), a g4dn.xlarge (NVIDIA Tesla T4 GPU), and a f1.2xlarge (Xilinx Virtex UltraScale+ VU9P FPGA) instance. Instead of repeating the same table as above, we have compared the cost efficiency for each hardware platform in terms of million queries per Dollar. To this end, we have calculated how many instances are required to meet the above throughput constraint and attached the AWS pricing accordingly (https://aws.amazon.com/ec2/pricing/on-demand/?nc1=h_ls).

CONCLUSION

Operating decision tree ensembles in the prediction mode with high throughput requirements can be computationally demanding. GPU accelerators show a noticeable improvement of speed and cost efficiency over CPU-only scale-out solutions. However, FPGA accelerators provide the most significant improvement over CPU and GPU platforms: We see two orders of magnitude speed-up over CPUs and GPUs and one order of magnitude cost savings over CPUs and GPUs. The secret as to why FPGAs perform so well on this class of workloads is their unique memory architecture, which consists of thousands of independent blocks of on-chip memory. This memory is not only highly parallel a key difference to the GPU memory is that it can handle highly parallel, irregular memory accesses very well.

NEXT STEPS

Xelera offers the plugin for the accelerated decision tree ensembles as part of Xelera Suite to customers. The software is available for on-premises as well as cloud deployments. Xelera Suite provides the integration into standard machine learning frameworks in order to allow users without any knowledge of accelerator technology to leverage the benefits shown above. The table below shows the current status of framework support.

We will provide an API to give trial users access to the acceleration software on the public cloud in due course. Stay tuned and contact us at info@xelera.io if you are interested.

Accelerating Decision Tree-Based Predictive Analytics

ACCELERATING DECISION TREE ENSEMBLES

TEST SETUP

BENCHMARK RESULTS

ON-PREMISES BENCHMARK

CLOUD BENCHMARK

CONCLUSION

NEXT STEPS

Further articles you might like

Machine Learning Inference for HFT: How Xelera Silva and ICC Deliver Ultra-Low Latency Trading Decisions

Napatech SmartNICs and Xelera Silva Software Accelerate AI Inference for High-Frequency Trading

Recap 2024

Products

Company

Resources