Cloud layer: forge-cloud

OpenAI-compatible reasoning-aware inference proxy. Points your existing OpenAI client at forge-cloud instead of directly at vLLM/Ollama, and it auto-routes thinking mode, manages context budget, tunes MTP, and blocks known-broken configs.

How it works

You send a standard OpenAI chat completion request to forge-cloud
forge-cloud inspects the request: query complexity, model target, backend
The session layer (qwen-think) decides: thinking mode on/off, sampling params, context budget
The optimize layer (qwen3.6-mtp) checks: is MTP appropriate? Any blocked combos?
Request is forwarded to your backend (vLLM, SGLang, Ollama, DashScope)
Response comes back instrumented (thinking tokens vs response tokens, budget consumption)

Install

pip install forge-cloud

Usage

export FORGE_ADMIN_KEY=my-secret
export FORGE_BACKEND_URL=http://localhost:8000
forge-cloud  # runs on port 8741

Then point your OpenAI client at it:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8741/v1", api_key="your-key")
response = client.chat.completions.create(
    model="Qwen/Qwen3.6-27B",
    messages=[{"role": "user", "content": "Explain merge sort"}],
)

forge-cloud handles thinking mode routing, parameter swaps, and backend normalization transparently.

What it includes

Complexity-based routing (think vs. no-think)
API key management and rate limiting
SQLite-backed usage tracking
CI/CD workflows

Status

forge-cloud is built and tested locally. It is not yet deployed as a hosted service. The proxy code is open-source. A hosted version at a public endpoint is planned if there is sufficient demand from the community. See roadmap.