June 5, 2026

Ollama switches to llama.cpp—faster local inference

Share:

Tool of the Week

Ollama switches to llama.cpp backend architecture

0.30.0-rc31 replaces GGML with direct llama.cpp integration and GGUF compatibility, uses MLX for Apple Silicon inference.

Architectural shift to llama.cpp reduces abstraction layers and improves compatibility with GGUF ecosystem. Developers running local models on Apple Silicon see MLX acceleration; others need to validate performance/memory changes before production rollout.

Replaces GGML-based inference pipeline; requires testing against your model list (llama3.2-vision and laguna-xs.2 currently unsupported). Pre-release quality—only move to prod after benchmarking memory and speed on your hardware. Note: nomic-embed-text now lowercases inputs, breaking prior behavior.

  • This version of Ollama will change the architecture to directly support llama.cpp instead of building on top of GGML
  • allows for compatibility with GGUF file format
  • MLX is used to accelerate model inference on Apple Silicon
  • llama3.2-vision is not yet supported
  • nomic-embed-text now converts inputs to lowercase per the model card where prior Ollama versions incorrectly preserved mixed case
ollamallama-cpplocal-llmsapple-silicongguf

Dev Signal

Get issues like this in your inbox — free, 3x a week.

Quick Signals

Gate every AI request, not sessions alone

Inference theft scales because attackers can amortize auth checks across thousands of proxied calls—you must verify on every request, not per-session, using invisible bot detection that runs server-side before inference.

A single stolen frontier-model call costs ~$2 while your HTTP endpoint costs fractions of a cent; attackers resell at 5-10% discount for pure margin. Without per-request gates, your AI budget bleeds tens of thousands per attack cycle.

Replaces session-layer rate limits and IP blocks with per-request bot classification. Requires Vercel BotID client/server setup (~15 lines of code) or equivalent invisible CAPTCHA. Production-ready now—Vercel's own docs endpoint blocks >10k bot requests within minutes using this pattern.

  • a single prompt to an agent on a frontier model can cost $2
  • Vercel charges ~$2/million, a fraction of a cent per call
  • verification has to run on every AI request
  • Any check that runs per session amortizes the attacker's bypass cost across every subsequent inference call
  • BotID deep analysis detected and blocked more than ten thousand bot requests in the first minutes of the spike
  • inference cost run rate of over ten thousand dollars per day
api-securityai-abusebot-detectionrate-limitinginference-costs

OpenSearch Serverless lands in Vercel Marketplace

Provision vector search directly from Vercel dashboard with automatic env injection; replaces manual AWS setup for RAG and agentic workflows.

Eliminates context-switching between Vercel and AWS consoles for search infrastructure, cutting provisioning friction for vector-grounded LLM applications. Built-in hybrid/vector/lexical search support and 20× faster autoscaling directly address agentic workload patterns.

Ready now for new projects; replaces raw AWS OpenSearch onboarding. Requires Vercel + AWS account linkage. Worth trying immediately if building RAG apps on Vercel—$100 credit lowers barrier. Existing OpenSearch deployments need manual migration.

  • Amazon OpenSearch Serverless is now available in the Vercel Marketplace
  • Autoscales up to 20× faster, built for the bursty, unpredictable load patterns of agentic workloads
  • Scale to zero with no idle costs
  • Up to 60% cost savings by paying for actual usage instead of peak capacity
  • Unified support for vector, lexical, hybrid, and agentic search in a single collection
vector-searchvercelopensearchragserverless

Agent loops replace frameworks—ship in fifty lines

An AI agent is one synchronous loop: send conversation to model, run requested tools, feed results back, repeat until answered—no framework required.

Developers spend cycles evaluating agent frameworks when the core pattern fits in a single code block. Understanding this cuts through abstraction overhead and makes tool-use agents debuggable and portable.

Replaces opaque agent SDKs with transparent control flow. Requires Claude API access and basic async handling. Ready now—copy the loop, add your tools, deploy.

  • an agent is one loop: Send the conversation to the model. If it asks to use a tool, run the tool. Feed the result back. Repeat until it answers.
  • That's it. Tool use is just: the model emits a tool_use block, you run the matching function, you hand back a tool_result.
  • you can read every line of an agent in one sitting
agentstool-useclaude-apipatternsimplementation

Starlette Host header bypass breaks downstream auth chains

Malformed Host headers (containing /, ?, or #) bypass path-based access controls in Starlette by shifting URL parsing boundaries, affecting AI agent infrastructure and MCP servers.

If your FastAPI/Starlette middleware gates auth decisions on request.url.path, you're vulnerable to authentication bypass regardless of how correctly your individual components behave. Patch urgency is high for any AI service exposed without reverse-proxy protection.

Upgrade to Starlette 1.0.1 immediately. This is not a single-file bug—it's a three-layer interaction issue (ASGI → Starlette → middleware). The vulnerability is mitigated if you front with CDN/load-balancer/reverse-proxy, but internal LLM deployments and MCP servers lack this protection by default. Worth testing now against badhost.org scanner.

  • 325 million weekly downloads
  • allows attackers to use malformed HTTP Host headers to bypass path-based access controls
  • The vulnerability only emerges from the interaction between them
  • the path from Starlette quirk to LLM-serving primitive is not theoretical; it is the discovery path
  • potentially affected AI services are often deployed on internal networks, lab subnets, and LLM research environments that lack the reverse-proxy protection
  • the MCP spec mandates unauthenticated OAuth discovery endpoints, providing a reliable path for exploitation
starletteauthentication-bypasshost-header-injectionai-infrastructuresecurity-patch

AWS HTTP API bypasses Lambda authorizer with trailing slash

Path normalization mismatch between HTTP API's route matching and authorizer layers allows unauthenticated access when request paths include trailing slashes; authorizer context drops during integration mapping.

Teams relying on Lambda authorizers as the sole authentication gate on HTTP API can leak sensitive data or enable unauthorized state changes without code changes. Requires immediate audit of path handling and backend validation logic.

Immediate action: test protected routes with and without trailing slashes; add independent userId validation in every Lambda function rather than trusting authorizer context alone. Consider switching to REST API for security-sensitive endpoints despite cost/performance trade-off. HTTP API development was quietly put on hold 4-5 years ago, reducing likelihood of platform-level fixes.

  • GET /v1/accounts returned 401 Unauthorized. GET /v1/accounts/ returned 200 OK with full account data
  • HTTP API does greedy path matching by default
  • The authorizer sets context.authorizer.userId on the authenticated request
  • When the trailing-slash path hit the integration, userId arrived as undefined
  • It is the newer API but development was quietly put on hold 4-5 years ago
aws-api-gatewayauthentication-bypasslambda-authorizerpath-normalizationsecurity-audit

Enjoying Dev Signal? Get every issue in your inbox.

Free forever · 3 issues a week · One-click unsubscribe