Inference Cost Calculator

📦 Model

📏 Context Length

128K tokens

📝 Query Tokens

500 tokens

🧠 Effort / Thinking

Level 5 (Balanced)

⚡ Enable Sparse Attention

Off (Dense Mode)

💾 Dense Attention Cost

Input Cost $0.384

Output Cost $0.007

Thinking Cost $0.000

Total Cost $0.391

Est. Time 15.2s

⚡ Sparse Attention Cost

Input Cost $0.154

Output Cost $0.003

Thinking Cost $0.000

Total Cost $0.157

Est. Time 4.3s

💰 Cost Comparison Over Context Length

Model Pricing Strategy

Input-Tokens billiger als Output (context ist "einmalig", output ist generiert). Claude 4.5: 1:5 ratio. GPT-5.1: 1:3. DeepSeek: fast gleich (Training kostenintensiv).

Sparse Attention Saves 60%

Sparse Attention reduziert KV-Cache Compute um 60%, aber kostet extra für Lightning Indexer. Net: 40-60% Kostenersparnis bei langen Kontexten (256K+).

Long Context Trade-offs

Unter 32K: Sparse nicht sinnvoll (Overhead). 32K-256K: Sparse gewinnt. Über 256K: Sparse ist MUSS (Dense wird prohibitiv teuer).

Effort / Thinking Budget

Effort Parameter multipliziert Thinking-Tokens: Effort 1 = 100 Tokens, Effort 10 = 1000 Tokens. Lineare Kostenbeziehung. Pro Level: +10% Cost, +5% Quality.

Cost per Quality Metric

DeepSeek-V3.2: Billiger, aber weniger Reasoning. Claude 4.5: Teurer, aber besser Effort-Control. GPT-5.1: Adaptiv (auto wählt Thinking). ROI je nach Task.

Production Optimization

Smart Caching: Request 1: volle Cost. Request 2 (same context): nur Output-Tokens. Hybrid: Sparse + Dense (rerank top-10). Prompt caching: -50% Input Cost.

Inference Cost Calculator

Lernziele

Kontext: Wo sind wir?

Warum wichtig

Key Takeaways