Gimlet is the agent-native inference cloud

Deploy frontier agentic workloads to Gimlet's managed inference API in minutes. Built on heterogeneous hardware to deliver step-change gains in latency and throughput.

Gimlet features

Custom agents

Build and run custom agents that combine large language models, multimodal models, diffusion models, code sandboxes, remote filesystems, web search, MCP servers, custom code, and data sources. Gimlet models these workloads as multi-stage graphs and runs them at scale with high performance.

Managed inference API

Consume Gimlet as a managed inference API running on Gimlet's infrastructure, or deploy the same stack into your own data center for large-scale workloads. Gimlet handles orchestration, scaling, and scheduling.

Low latency, high throughput

Gimlet delivers very low latency and high throughput through its multi-silicon architecture, compiler, and orchestration stack. By matching each part of the inference workload to the best-suited hardware, Gimlet delivers fundamentally better performance than homogeneous infrastructure.

Our approach

No single chip is universally best for inference

Agentic inference is inherently heterogeneous: different parts of the workload, from models to model layers and even individual ops, have different bottlenecks and hardware needs. GPUs, CPUs, SRAM-centric chips, and other accelerators offer complementary strengths, but today's infrastructure is predominantly homogeneous. This fundamentally limits performance, especially at scale.

For optimal performance, different phases belong on different hardware

Gimlet slices and orchestrates agentic inference workloads across heterogeneous hardware to maximize throughput and minimize latency. Gimlet's compiler lowers each part of the inference graph onto the optimal hardware, from coarse-grained stages like prefill and decode to finer-grained layers and ops, capturing the performance advantages of each hardware type.

Gimlet unifies heterogeneous hardware under one inference cloud

Delivering step-change performance for agentic inference requires a new type of data center, built from the ground up to connect diverse hardware with a high-speed fabric so they can operate as one system. Gimlet is the first multi-silicon inference cloud, built for high performance at scale.

Performance

Push the latency-throughput Pareto frontier

By running each part of the inference workload on the best-suited hardware, Gimlet can generate more tokens than a homogeneous stack. Compared to traditional inference clouds, this approach enables much higher interactivity for the same throughput, or vice versa.

Get in touch

Whether you are evaluating Gimlet for frontier workloads or exploring a hardware partnership, we'd love to hear from you.

For customers

For frontier labs and companies running large-scale inference workloads looking to evaluate Gimlet's inference cloud.

For partners

For hardware providers looking to deploy their solutions at scale in Gimlet's multi-silicon inference cloud.

Gimlet is produced by the team at gimletlabs.ai. If you are interested in working with us, check out open positions at gimletlabs.ai/join_us