Cloud layer: forge-cloud
OpenAI-compatible reasoning-aware inference proxy. Points your existing OpenAI client at forge-cloud instead of directly at vLLM/Ollama, and it auto-routes thinking mode, manages context budget, tunes MTP, and blocks known-broken configs.
How it works
- You send a standard OpenAI chat completion request to forge-cloud
- forge-cloud inspects the request: query complexity, model target, backend
- The session layer (qwen-think) decides: thinking mode on/off, sampling params, context budget
- The optimize layer (qwen3.6-mtp) checks: is MTP appropriate? Any blocked combos?
- Request is forwarded to your backend (vLLM, SGLang, Ollama, DashScope)
- Response comes back instrumented (thinking tokens vs response tokens, budget consumption)
Install
Usage
export FORGE_ADMIN_KEY=my-secret
export FORGE_BACKEND_URL=http://localhost:8000
forge-cloud # runs on port 8741
Then point your OpenAI client at it:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8741/v1", api_key="your-key")
response = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=[{"role": "user", "content": "Explain merge sort"}],
)
forge-cloud handles thinking mode routing, parameter swaps, and backend normalization transparently.
What it includes
- Complexity-based routing (think vs. no-think)
- API key management and rate limiting
- SQLite-backed usage tracking
- CI/CD workflows
Status
forge-cloud is built and tested locally. It is not yet deployed as a hosted service. The proxy code is open-source. A hosted version at a public endpoint is planned if there is sufficient demand from the community. See roadmap.