Kove Redefines AI Infrastructure: Software-Defined Memory Unlocks Scalable KV Capabilities for Next-Generation Inference
Kove IdeasAt AI Infra Summit, Kove demonstrated how Software-Defined Memory enables memory-bound KV workloads to scale significantly larger than local DRAM while maintaining like-local latency — a critical capability as AI inference grows increasingly memory constrained.
Watch the full AI Infra 2025 Keynote
The Compute Race Has Left Memory Behind
The AI infrastructure narrative in 2025 has been dominated by compute. Meta, NVIDIA, and AWS are racing to deliver faster accelerators, larger GPU clusters, and new memory hierarchies. Benchmarks like MLPerf continue to push theoretical ceilings ever higher.
But enterprises running real workloads know the truth:
AI isn’t compute-bound anymore, it’s memory-bound.
- Most latency stalls originate in memory, not compute.
- GPUs frequently idle while waiting for data.
- DRAM remains rigid, tied to individual servers, and chronically underutilized.
The result is an expensive paradox:
Organizations buy more compute, more GPUs, and more hardware, yet workloads still slow down because memory hasn’t kept pace.
At AI Infra Summit 2025, Kove CEO John Overton introduced a fundamentally different paradigm: Software-Defined Memory (Kove:SDM™) — a platform that virtualizes DRAM across servers into a unified, elastic memory pool with latency equivalent to local DRAM, even when memory is served from across the data center.
Kove:SDM™ delivers the next layer of AI infrastructure that has been missing: Elastic, scalable, like-local memory that eliminates DRAM ceilings.
Why Memory Has Become the Bottleneck
AI workloads are scaling in every dimension: context length, model size, concurrency, and real-time demands. Compute is no longer the limiting factor, memory is.
Traditional DRAM is:
- Locked to a single server, unable to be pooled or shared.
- Provisioned for peak, creating massive stranding and low utilization.
- Inflexible, requiring oversized, high-memory servers that sit idle most of the time.
This leads to pervasive inefficiencies:
- Training pipelines fragment to fit memory constraints.
- Inference pipelines stall when memory limits are reached.
- Enterprises overspend on hardware yet still hit DRAM ceilings.
The next major unlock isn’t more compute. It’s removing memory constraints altogether.
The Inference Memory Problem: KV-Style Access Patterns
Today’s inference engines — including vLLM, SGLang, TensorRT-LLM, and other large-context architectures — rely heavily on KV-style access patterns.
Even though Redis and Valkey are not the systems vLLM uses, the technical pattern is the same:
- High-velocity key/value lookups
- Strict latency sensitivity
- Performance collapse when memory limits are reached
- KV datasets growing rapidly as context windows expand
When inference workloads outgrow DRAM, CPU-side KV data spills into slower tiers or requires recomputation, which wastes GPU cycles and inflates infrastructure cost.
Expanding and sustaining DRAM-class performance for KV-style workloads is now one of the biggest unlocks for scalable inference.
This is where Kove:SDM™ shines.
Benchmarking SDM: Redis & Valkey as Proxies for KV Performance at Scale
During AI Infra Summit, Kove shared benchmark results using Redis and Valkey, not because hyperscalers use them directly for inference engines like vLLM but because they are:
- Widely adopted
- Well-understood
- Highly memory-bound
- Latency-sensitive
- Excellent proxies for evaluating KV-style workload behavior under DRAM pressure
These systems represent a clean, industry-recognized way to demonstrate how Kove:SDM™ handles the very access patterns that dominate inference.
Key Insight:
If Kove:SDM™ can sustain DRAM-class latency and stability running KV workloads far beyond local DRAM limits, it can also sustain larger CPU-side KV structures for inference frameworks without requiring tiering or recomputation.
Benchmark Highlights
Redis Benchmark (General KV Workload Scaling)
Environment: Redis OSS v7.2.4 on Oracle Cloud Infrastructure
Results demonstrated that Kove:SDM™ enabled:
- Workloads approximately 5x larger than the server’s physical DRAM
- Latency equivalent to or better than local memory in most operations
- Stable throughput even as working sets expanded significantly
Why this matters:
Redis runs in-memory and has well-understood performance properties. Its performance under SDM validates that memory pooling can dramatically expand DRAM-limited workloads without sacrificing latency.
Valkey Benchmark (Relevant KV Pattern for Inference)
Environment: Valkey v8.0.2 in an Oracle RoCE test environment
Results demonstrated:
- Support for workloads nearly 5x larger than local DRAM
- Latency consistent with local DRAM behavior
- Stable throughput as dataset size scaled
Why this matters:
Valkey’s performance under SDM confirms that large, latency-sensitive KV datasets can operate at DRAM-equivalent performance even when the memory footprint far exceeds local server capacity.
These results apply directly to the KV-style access behavior seen in modern inference systems, even though the systems themselves differ.
Why It Matters for the Future of Inference
As John Overton emphasized:
“Every recompute avoided is GPU time returned to the business. Memory limits create structural waste. Removing those limits creates structural efficiency.”
— John Overton, CEO of Kove
The takeaway is clear:
- Redis proves SDM can scale KV workloads.
- Valkey proves SDM maintains DRAM-class performance across expanded memory footprints.
- Together, they demonstrate the ability to support larger CPU-side KV datasets, a key component of high-throughput, large-context inference engines.
This is the foundation for scaling AI inference sustainably.
The Business Impact
Memory ceilings inflate cost across the entire AI stack. Kove:SDM™ reverses that dynamic.
Enterprises typically see:
annual savings at
large scale
reduction in hardware
spend by delaying
server refresh cycles
reduction in power
and cooling
by eliminating memory overrun
conditions
Why Now
Inference demand is accelerating faster than compute supply:
- Context windows are growing into the hundreds of thousands
- Inference already surpasses training costs in many organizations
- DRAM density and pricing can’t keep up with model growth
- GPUs remain underutilized because they wait on memory, not compute
Without a new memory architecture, inference scaling becomes unsustainable.
With Kove:SDM™, organizations can:
- Scale workloads 5x larger on the same servers
- Reduce GPU idle time by eliminating memory-induced recompute
- Lower cost per token and per inference
- Operate within existing power, space, and cost envelopes
Frequently Asked Questions (FAQs)
Q: What is the memory bottleneck in AI?
A: AI models generate enormous datasets that outstrip the capacity of server DRAM. When memory fills up, data is evicted, forcing recomputation or disk spills. This slows workloads and wastes expensive GPU cycles.
Q: What is KV Cache and why is it important for inference?
A: KV Cache stores key/value pairs of previously computed tokens during inference. By reusing this data, models avoid recomputation and generate responses faster. If the cache is too small, data is evicted and GPUs must redo the work.
Q: Why benchmark Redis and Valkey?
A: They are industry-standard KV systems and excellent proxies for stress-testing memory-bound KV access patterns. The results transfer to AI inference because the underlying memory behaviors are similar.
Redis has long been the world’s most popular in-memory KV store. Valkey is an open-source fork optimized for performance and integrated into AI inference frameworks like vLLM via LMCache. Benchmarks on Redis prove the general case; Valkey benchmarks show direct relevance to inference workloads.
Q: How does Kove:SDM™ improve Redis and Valkey?
A: By pooling DRAM across servers, SDM enables Redis and Valkey to handle 5x larger workloads at stable latency. This expands KV Cache capacity, reducing recomputes and improving inference throughput.
Q: How does Kove:SDM™ help inference engines?
A: By removing DRAM ceilings for CPU-side KV structures, SDM reduces recomputation, lowers GPU idle time, and sustains throughput as context windows grow.
Q: Is Kove:SDM™ available today?
A: Yes. It runs on existing x86 servers. No rewrites, no kernel changes, and no new hardware required.
Q: How is Kove:SDM™ different from other approaches (like CXL or storage tiering)?
- CXL: Hardware-based, adds latency.
- Storage tiering (NVIDIA, DDN, Weka, etc.): Offloads cache to SSD/storage, slower than DRAM.
- Kove:SDM™: Software-only, available today, pools DRAM across servers at local or lower latency, indistinguishable from local memory.