Gimlet is the agent-native inference cloud
Deploy frontier agentic workloads to Gimlet's managed inference API in minutes. Built on heterogeneous hardware to deliver step-change gains in latency and throughput.
Gimlet features
Custom agents
Build and run custom agents that combine large language models, multimodal models, diffusion models, code sandboxes, remote filesystems, web search, MCP servers, custom code, and data sources. Gimlet models these workloads as multi-stage graphs and runs them at scale with high performance.
Managed inference API
Consume Gimlet as a managed inference API running on Gimlet's infrastructure, or deploy the same stack into your own data center for large-scale workloads. Gimlet handles orchestration, scaling, and scheduling.
Low latency, high throughput
Gimlet delivers very low latency and high throughput through its multi-silicon architecture, compiler, and orchestration stack. By matching each part of the inference workload to the best-suited hardware, Gimlet delivers fundamentally better performance than homogeneous infrastructure.
Our approach
No single chip is universally best for inference
Agentic inference is inherently heterogeneous: different parts of the workload, from models to model layers and even individual ops, have different bottlenecks and hardware needs. GPUs, CPUs, SRAM-centric chips, and other accelerators offer complementary strengths, but today's infrastructure is predominantly homogeneous. This fundamentally limits performance, especially at scale.
For optimal performance, different phases belong on different hardware
Gimlet slices and orchestrates agentic inference workloads across heterogeneous hardware to maximize throughput and minimize latency. Gimlet's compiler lowers each part of the inference graph onto the optimal hardware, from coarse-grained stages like prefill and decode to finer-grained layers and ops, capturing the performance advantages of each hardware type.
Gimlet unifies heterogeneous hardware under one inference cloud
Delivering step-change performance for agentic inference requires a new type of data center, built from the ground up to connect diverse hardware with a high-speed fabric so they can operate as one system. Gimlet is the first multi-silicon inference cloud, built for high performance at scale.
Performance
Push the latency-throughput Pareto frontier
By running each part of the inference workload on the best-suited hardware, Gimlet can generate more tokens than a homogeneous stack. Compared to traditional inference clouds, this approach enables much higher interactivity for the same throughput, or vice versa.
Get in touch
Whether you are evaluating Gimlet for frontier workloads or exploring a hardware partnership, we'd love to hear from you.
Gimlet is produced by the team at gimletlabs.ai. If you are interested in working with us, check out open positions at gimletlabs.ai/join_us