Xelera, AMD and HPE have completed the first independently audited submission under the STAC-ML El Popo specification. The configuration: Xelera Silva 8.1.0 running in Accelerator mode on an AMD Alveo V80 FPGA card installed in an HPE ProLiant DL385 Gen10 Plus v2 server. The El Popo suite evaluates GBT inference across three model topologies at batch size of one. GBT_A (64 input features, 12K-nodes) achieved a p99 latency of 1.60 microseconds at NMI 1 and 1.88 microseconds at NMI 15, with total throughput scaling near-linearly from 708,570 to 8,915,430 inferences per second. GBT_B (128 input features, 512K-nodes) reached 1.78 microseconds p99 at NMI 1 and 1.95 microseconds at NMI 6, with throughput of 3,364,854 inferences per second. GBT_C (1,000 input features,4M-nodes), the most demanding topology in the suite, achieved 2.88 microseconds p99 at NMI 1. Inference precision is float64-equivalent across all three topologies.
STAC-ML El Popo is the benchmark specification the financial industry wrote for itself. The STAC Benchmark Council, whose membership spans more than 500 banks, hedge funds, electronic market makers, and exchanges, designed El Popo to measure gradient-boosted tree inference under the load conditions that live systematic strategies and risk systems operate under. An independently audited STAC result is produced on production hardware, with full methodology published in the STAC Vault for member institutions to review and reference in procurement decisions.
Configuration XLRA260312 is the first audited El Popo result. It establishes the reference performance baseline for GBT inference in capital markets.
Audited benchmark results — STAC-ML El Popo, configuration XLRA260312
GBT_A, 64 input features, 12K-nodes
• p99 latency: 1.60µs at NMI 1 | 1.88µs at NMI 15
• Throughput: 708,570 inf/s at NMI 1, scaling to 8,915,430 inf/s at NMI 15
• Inference error: 3.55e-15, float64-equivalent precision
GBT_B, 128 input features, 512K-nodes
• p99 latency: 1.78µs at NMI 1 | 1.95µs at NMI 6
• Throughput: 637,273 inf/s at NMI 1, scaling to 3,364,854 inf/s at NMI 6
• Inference error: 6.33e-15, float64-equivalent precision
GBT_C, 1,000 input features, 4M-nodes
• p99 latency: 2.88µs at NMI 1
• Throughput: 379,026 inf/s at NMI 1
• Inference error: 2.23e-14, float64-equivalent precision
Test conditions: batch size of one | independent STAC audit
XeleraSilva: three execution modes, one API
Xelera Silva is an FPGA-based inference engine for gradient-boosted tree models and neural networks. It exposes a unified C, C++, C# and Python API and operates across three execution modes. The API call, the trained model file, and all integration code remain identical regardless of which mode is active. Switching is a single configuration parameter.
The STAC-ML El Popo benchmark was run in Accelerator mode, the mode designed for larger GBT topologies, up to 4M-nodes, and for multi-strategy environments where several models run simultaneously. In Accelerator mode, inference moves from the host CPU to dedicated FPGA fabric on the AMD Alveo V80. Each model runs as an independent pipeline with its own allocated fabric resources. There is no shared execution queue. The practical outcome is that 99th-percentile latency remains flat as the number of concurrent instances increases: the latency of a strategy already running is not affected when a further model is added alongside it.
The other two modes extend the range. CPU-Only mode covers smaller GBT topologies, up to roughly 10K-nodes, that fit within CPU cache and run pinned to an isolated core. At that model size, sub-microsecond p99 is achievable with no FPGA hardware at all. Inline Silicon mode removes the PCIe data path entirely. Inference runs as an on-chip IP core, producing a result in the low hundreds of nanoseconds with determinism that does not vary with system load. This is the mode for tick-to-trade paths where any tail behaviour, even an infrequent one, carries a direct cost.
From trained model to first inference takes minutes, not days. The integration path for Accelerator mode starts with a single RegisterModel() call, passing a native XGBoost, LightGBM, or CatBoost model file. Silva compiles and loads the model into Alveo V80 high-bandwidth memory at registration time. No retraining, and no FPGA development skills are required. Models can be hot-swapped during operation without restarting the host process. The API is thread-safe and process-safe, which allows concurrent strategy deployments to share a single Silva runtime instance.
A 30-day evaluation period is available, and every evaluation includes a model-specific benchmark assessment run on the customer’s own weights and feature distributions. Teams can verify performance against their actual production topology before any procurement commitment.
What the STAC audit measured and why the numbers matter
GBT inference is structurally difficult for standard CPU frameworks. Each scoring operation traverses a large number of decision trees following paths determined by the input feature vector. Memory access is irregular and branch-dependent. The processor cannot predict which location it will need next, so prefetching does not help, and the traversal bottleneck does not yield to the vectorisation techniques that are effective for matrix-heavy workloads. A 500K-nodes ensemble on a standard CPU framework costs approximately 30 to 33 microseconds p99, a figure that rises further under concurrent load.
Two aspects of the results stand out beyond the headline p99 figures. The first is throughput scaling. GBT_A throughput grows from 708,570 inferences per second at NMI 1 to 8,915,430 at NMI 15, a near-linear increase that reflects the absence of shared-resource contention between concurrent model instances on the Alveo V80 fabric. GBT_B shows the same pattern, reaching 3,364,854 inferences per second at NMI 6. In a CPU-based inference environment, adding concurrent instances degrades p99 latency for existing models. The FPGA architecture avoids this because each model occupies its own dedicated pipeline.
The second is p99 stability under concurrency. GBT_A p99 moves from 1.60 microseconds at NMI 1 to 1.88 microseconds at NMI 15, a difference of 280 nanoseconds across a 15-fold increase in concurrent instances. GBT_B p99 moves from 1.78 to 1.95 microseconds across NMI 1 to NMI 6. Inference precision across all three topologies is float64-equivalent, confirmed by error benchmarks of 3.55e-15, 6.33e-15, and 2.23e-14 for GBT_A, GBT_B, and GBT_C respectively.
The test hardware was the HPE ProLiant DL385 Gen10 Plusv2, configured with dual AMD EPYC 7763 processors, four terabytes of DDR4memory, and an AMD Alveo V80 FPGA accelerator. AMD and HPE provided direct technical support throughout the benchmark configuration process. The DL385 was chosen as the host platform because its NUMA architecture and BIOS configuration options allow each model instance to be pinned to a dedicated processor core with direct PCIe locality to the Alveo V80, and kernel-level core isolation removes operating system scheduling from the timing path. The complete hardware, firmware, driver, and software configuration is published in the STAC Vault as XLRA260312.
Why gradient-boosted trees are the right inference target
GBT frameworks, specifically XGBoost, LightGBM, and CatBoost, are the architecture that systematic capital markets runs on. On structured tabular market data, which covers equity, rates, FX, and derivatives signals, GBT models produce stronger out-of-sample accuracy than deep learning architectures and meet the explainability requirements of MiFID II and SR 11-7through SHAP-based feature attribution without approximation. They retrain on an intraday timescale when market conditions shift, a capability that transformer and LSTM architectures cannot currently match inside a production deployment cycle.
Deployment scenarios
Systematic trading desks. A live systematic operation typically runs several independent strategies in parallel, each driven by its own signal model. Accelerator mode maps directly to this architecture: each strategy registers its own model instance on the Alveo V80, where it runs as an isolated pipeline with no resource contention against the others. Adding a new strategy does not affect the latency of those already running. Existing trained weights load directly, with no retraining or conversion required.
Pre-trade risk. Order validation requires a scoring decision on every order, on the critical path, before execution. The relevant deployment is a single model instance called once per order, where what matters is that the latency overhead is deterministic and bounded. Models can be hot-swapped into the running system without process restarts, which means risk model updates do not require a maintenance window or trading halt.
Financial crime and AML. Transaction monitoring operates at high volume across a continuous flow of events. The relevant metric here is throughput: how many transactions can be scored per second against a GBT ensemble of meaningful size. A single HPE DL385 node with AMD Alveo V80 running multiple concurrent model instances consolidates what has historically required a large CPU server estate into a 2U footprint, with no changes to existing model training pipelines or feature engineering.
Nextsteps
The STAC-ML El Popo results for configuration XLRA260312are available to member institutions in the STAC Vault. The published configuration documentation is sufficient to support a production procurement specification.
Evaluate Silva on your own model topology.
Contact Xelera at info@xelera.io or visit xelera.io/silva. Every evaluation includes a benchmark assessment on your own trained model weights. A 30-day trial period is available.
Review the independent audit.
Access configuration XLRA260312 in the STAC Vault at stacresearch.com. STAC Vault membership is required for full result details.
Procure the validated hardware configuration.
The HPE ProLiant DL385 Gen10 Plus v2 with AMD Alveo V80is available through standard enterprise procurement and HPE GreenLake.Reference XLRA260312 as the production specification.