Forge
Framework for Orchestrating Reasoning in Generative Engines
Early alpha -- everything here is a work in progress
This entire project is in very early alpha. The libraries are functional but APIs will change, docs are incomplete, and nothing is production-ready. If you're here, you're early. Expect rough edges.
Inference control plane for reasoning-aware open-source models.
The three dominant self-hosting tools -- vLLM, SGLang, Ollama -- are pipes. They move tokens between your application and the inference engine, but none of them have a layer that understands thinking modes, hybrid KV memory profiles, or speculative decoding tradeoffs. Forge builds that layer.
Qwen3.6 is the starting point because it makes these problems explicit (thinking toggling, MTP, linear attention context budgets), but the same patterns show up in Mistral Small 4, GLM-5.1, and DeepSeek V4.
Architecture
Each layer has a clean contract with the one below it. ForgeEngine is the integration point that chains session + MTP + context into one configure-once object. The product layers (cloud proxy, dashboard, observe) sit on top. Apps (forge-studio, Open WebUI plugins) are the user-facing front doors.
Install
This pulls in qwen-think, qwen3.6-mtp, and qwen3-repo as dependencies.
Optional extras for the product layers:
pip install forge-infer[observe] # + forge-observe (OTel instrumentation)
pip install forge-infer[cloud] # + forge-infer-cloud (proxy)
pip install forge-infer[dashboard] # + forge-dashboard (observability UI)
pip install forge-infer[openai] # + openai SDK (for ForgeEngine)
pip install forge-infer[all] # everything
Individual packages are also installable standalone:
pip install qwen-think # session management
pip install qwen3.6-mtp # MTP speculative decoding tuner
pip install qwen3-repo # repo-to-context ingestion
Quick start
ForgeEngine (recommended)
The integration layer that chains session + MTP + context into one object:
from forge import ForgeEngine
engine = ForgeEngine(
model="Qwen3.6-27B",
base_url="http://localhost:8000/v1",
gpu_id="rtx-4090",
)
engine.ingest_local("/path/to/project")
response = engine.chat("Why is the auth middleware rejecting valid tokens?")
print(engine.status())
Individual layers
from forge.session import ThinkingSession
session = ThinkingSession(model="Qwen/Qwen3.6-27B")
response = session.chat("Explain merge sort", thinking=True)
from forge.mtp import recommend, UseCase, Objective
rec = recommend(
use_case=UseCase.SINGLE_USER,
objective=Objective.MINIMIZE_LATENCY,
gpu_id="rtx-4090",
)
print(rec.enable, rec.expected_gain)
See Getting started for more examples and the full layer-by-layer walkthrough.
The suite
Core libraries (Phase 1)
| Package | PyPI | What it does |
|---|---|---|
| forge-infer | pip install forge-infer |
Metapackage + ForgeEngine integration layer |
| qwen-think | pip install qwen-think |
Thinking-mode session control, backend normalization, context budget |
| qwen3.6-mtp | pip install qwen3.6-mtp |
MTP speculative decoding tuner, crossover analysis, config generation |
| qwen3-repo | pip install qwen3-repo |
Dependency-ordered repo ingestion for linear attention models |
| qwen-compat | -- | Compatibility test matrix and upstream bug fixes |
Product layers (Phase 2)
| Package | PyPI | What it does |
|---|---|---|
| forge-observe | pip install forge-observe |
Reasoning-aware OTel instrumentation |
| forge-cloud | pip install forge-infer-cloud |
OpenAI-compatible reasoning-aware inference proxy |
| forge-dashboard | pip install forge-dashboard |
Observability backend + web UI |
| forge-studio | -- | Power-user web app (FastAPI + React) |
| Open WebUI plugins | -- | Auto-route, budget tracker, repo ingest, OTel enrichment for Open WebUI |
License
Apache 2.0