Ultra-low Latency XGBoost with Xelera Silva

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Preferences Deny Accept

Privacy Preference Center

When you visit websites, they may store or retrieve data in your browser. This storage is often necessary for the basic functionality of the website. The storage may be used for marketing, analytics, and personalization of the site, such as storing your preferences. Privacy is important to us, so you have the option of disabling certain types of storage that may not be necessary for the basic functioning of the website. Blocking categories may impact your experience on the website.

Reject all cookies Allow all cookies

Manage Consent Preferences by Category

Essential

Always Active

These items are required to enable basic website functionality.

Marketing

Essential

These items are used to deliver advertising that is more relevant to you and your interests. They may also be used to limit the number of times you see an advertisement and measure the effectiveness of advertising campaigns. Advertising networks usually place them with the website operator’s permission.

Personalization

Essential

These items allow the website to remember choices you make (such as your user name, language, or the region you are in) and provide enhanced, more personal features. For example, a website may provide you with local weather reports or traffic news by storing data about your current location.

Analytics

Essential

These items help the website operator understand how its website performs, how visitors interact with the site, and whether there may be technical issues. This storage type usually doesn’t collect information that identifies a visitor.

Confirm my preferences and close

Machine learning-enabled strategies in electronic trading proliferate. We are seeing trading companies moving machine learning into the hot path of the ultra-low latency trading cycle - a task that remains challenging. 5 microseconds of delay can draw the line between a successful and unsuccessful trading strategy. Machine learning inference often takes up most of the latency of the trading loop and in most cases does not operate at microseconds latency. In this article, we take a closer look at low-latency machine learning. We focus on the machine learning algorithms XGBoost, LightGBM and CatBoost, which are widely used frameworks in many different applications and which can provide ML inference in microseconds if done right.

These algorithms share a commonality: They are based on learned decision tree ensembles, which are trained on existing datasets and then applied to new incoming data (prediction / inference phase). The input data format is a vector of numerical and/or categorical values, called features, which are often the outcome of a preprocessing step. Using ensembles of many decision trees (1000s) with a voting mechanism boosts the prediction accuracy and prevents overfitting on the training data.

A trained XGBoost model enables a systematic analysis of financial data to identify complex patterns, correlations, and trends in price movements. For example, based on historical data, short-term stock price movements can be predicted more accurately than with traditional methods. The ML model helps decipher the intricate details of order flow, bid-ask spreads, and liquidity dynamics to help predict price fluctuations.

The inference latency of a machine learning model is measured as the time between the incoming data being sent to the model and the prediction result being received from the model. The large number of different machine learning / AI algorithms must be caused by the inference latency of their models. Large Language Models (LLMs) are very sophisticated but have inference latency in the order of seconds. Long Short-Term Memory (LSTM) models are popular methods for analyzing time series data.

However, their inference latency does not go below a few tens of microseconds, even when run on highly optimized FPGA accelerators. Here we focus on Gradient Boosting Machines (GBMs), of which XGBoost, LightGBM, and CatBoost are the best-known representatives. GBMs provide a prediction accuracy comparable to LSTMs. As we shall see below, GBM models can provide inference latencies in the single-digit microsecond range (sub-microsecond in certain circumstances).

Ultra-low latency XGBoost inference is only achievable with special-purpose accelerator chips. In this case, the latency is not only composed by the execution time of the machine learning model itself, but also by the time it takes to the data to arrive at the chip. For example, a typical data flow is: submission of new data to a software API, propagation through the driver and PCIe bus of the accelerator card, execution time of the ML model on the chip, propagation of the resulting prediction back, (through the PCIe bus and the driver) to the software API. We refer to this as the API latency because it provides the true, end-to-end latency seen by the user.

TEST SETUP

The test setup is shown in the figure below. Xelera Silva, the software package for XGBoost, LightGBM and CatBoost inference acceleration offloads the ML model execution to AMD hardware accelerator platforms and provides a high-performance C++ API to the user interface. The software loads models created with the default open-source versions of XGBoost, LightGBM or CatBoost. In this benchmark below, LightGBM models have been used (please contact us about for more details including XGBoost or CatBoost benchmarks).

The benchmark parameters are listed below. In typical use cases, more than one ML model is loaded into the accelerator and multiple models execute simultaneously on the same input vector.

‍

‍

BENCHMARK RESULTS

The benchmark is performed with the hardware specified in the table above. We sweep over two of these parameters: the number of decision trees per model (model size) and the number of models executing concurrently. The API latencies are shown in microseconds in the table below.

All measurements include the overhead for transferring the data and results through the driver and over the PCIe bus to the accelerator card. The results show a stable latency that remains invariant with the number of concurrent models on the accelerator and that increases only by a small amount with larger models.

CONCLUSION

In electronic trading latency is extremely important. At the same time, robust methods for short-term predictions of financial time series are of great value, and an increasing number of firms use machine learning algorithms to achieve this goal. XGBoost, LightGBM and CatBoost are popular frameworks in this area. However, low-latency machine learning is challenging because most algorithm implementations do not execute in the single-digit microseconds domain, which is required for many ML-based high-speed strategies to become useful. We show that single-digit microseconds latency can be achieved with a purpose-built accelerator architecture. On the other hand, there are possibilities to further reduce the latency as we discuss below.

NEXT STEPS

There are several ways to further reduce the latency of Xelera Silva. First, the I/O interface (in this case the PCIe bus) plays a dominant role in the overall latency budget. Higher PCIe generations as well as cache-coherent CPU accelerator interfaces such as CXL (available on upcoming AMD Alveo accelerator boards) will have a favorable impact on end-to-end latency. Second, significant latency reduction is expected from an inline variant of Xelera Silva - one that eliminates PCIe transfer altogether and accesses the on-chip inference engine via the accelerator card's network ports.

‍

Ultra-low Latency XGBoost with Xelera Silva

TEST SETUP

BENCHMARK RESULTS

CONCLUSION

NEXT STEPS

Further articles you might like

Recap 2023

Low-latency Machine Learning Inference for High-Frequency Trading

High-Frequency Trading with Machine Learning Algorithms

Products

Company

Resources