Should we fine-tune an LLM or use prompt engineering + RAG?

Default to prompt engineering + RAG for the first 80% of LLM workloads. Fine-tuning adds engineering cost (data curation, annotation, training infrastructure, eval harnesses, ongoing model maintenance) that often isn’t justified when prompt + RAG gets you within 5 quality points. Fine-tuning IS the right answer when: (a) you need consistent brand voice or response format that prompts can’t reliably enforce; (b) latency or cost requires a smaller model, and a fine-tuned small model beats a prompt-engineered large model; (c) your domain vocabulary or reasoning patterns are sufficiently different from base-model training data that retrieval alone doesn’t close the gap. Most fine-tuning programs we’re called in to fix should have stayed at prompt + RAG.

Which model should we use , GPT-5, Claude Sonnet 4.6, Gemini 2.5, or open-source?

Depends on the workload profile. GPT-5 and Claude Sonnet 4.6 offer the strongest reasoning and tool-use reliability , use them for steps where decisions matter and your data can be processed in the cloud. Claude Opus 4.6 wins on long-context analysis (1M tokens) and complex multi-step reasoning. Gemini 2.5 Flash wins on cost for high-volume cheap calls. Open-source (Llama 3.3, Qwen 3, Mistral, DeepSeek) wins when you need on-premises deployment for sovereignty, when transaction volume justifies a smaller fine-tuned model, or when unit cost is a hard constraint. Most production LLM stacks we ship use a hybrid with smart routing across multiple models , not single-model standardisation.

What's the difference between LoRA, QLoRA, and full fine-tuning?

LoRA (Low-Rank Adaptation) trains small adapter matrices on top of frozen base-model weights , typically 0.1 to 1% of full-fine-tune parameters. Fast, cheap, multi-LoRA serving on a single GPU is supported. Right answer for most fine-tuning needs. QLoRA (Quantised LoRA) loads the base model in 4-bit precision while training the LoRA adapters , lets you fine-tune 70B+ models on a single GPU. Right answer when you need bigger models on smaller hardware. Full fine-tuning updates all model weights , best quality lift, dramatically higher cost (10 to 100× LoRA), single model per deployment (no multi-LoRA serving). Right answer when LoRA’s quality lift isn’t enough and you have the data + budget.

How much does it cost to fine-tune an LLM?

Discovery and architecture sprints: $25k-$50k. LoRA / QLoRA fine-tuning engagement (data curation + annotation + training + eval + production deployment): $80k-$300k depending on model size, data complexity, and required eval rigour. Full fine-tuning of frontier-class models: $200k-$1M+ depending on model. Domain-specific continued pre-training: $300k-$1.5M depending on corpus size. Ongoing serving cost depends on volume + model , open-source on vLLM typically lands $0.0001-$0.001 per 1K tokens; OpenAI GPT-5 fine-tunes typically $0.005-$0.02 per 1K tokens. Most fine-tuning programs we’ve shipped pay back within 6 to 14 months on cost-per-call savings vs always-GPT-5 baseline.

Can we run LLMs on-premises or in air-gapped environments?

Yes. For sovereignty, regulatory, or air-gapped requirements we deploy open-source models (Llama 3.3, Qwen 3, Mistral, DeepSeek) on customer infrastructure using vLLM, Ollama, NVIDIA NIM, or Triton Inference Server. Healthcare PHI workloads, banking SR 11-7 model risk requirements, defence / government / public sector all routinely require this. Hybrid deployments (on-prem for sensitive, cloud frontier for non-sensitive) are increasingly common , we design these from the workload constraints, not vendor preference.

What's an LLM eval harness and why does every production system need one?

An eval harness is a curated set of test cases (golden datasets), automated metrics, and acceptance thresholds that every prompt change, fine-tune, or model swap must pass before reaching production. Without it, LLM quality silently regresses as prompts and models evolve , and you don’t notice until users complain or business metrics drop. We treat the eval harness as a first-class production deliverable: built day one, expanded over time, and integrated into CI/CD as a deploy gate. The single highest-ROI engineering practice in LLM development. Most production-quality LLM systems we ship have 200 to 2,000 golden test cases by month six.

How do you optimise LLM costs at scale?

Five levers. (1) Smart routing , frontier models for nuanced reasoning, fine-tuned open-source for high-volume deterministic tasks; classifier picks per request. (2) Reasoning-vs-execution split , reasoning models for planning, fast cheap models for execution. (3) Structured outputs , eliminates the “model wrote prose when we needed JSON” retry loop. (4) Multi-LoRA serving , one base model serves many customer-specific adapters from one GPU. (5) Speculative decoding , draft model proposes tokens, larger model verifies; 2 to 4× throughput gain. Combined, these typically cut LLM bills 60 to 80% with no quality regression. Most production LLM stacks burn 5 to 10× more than they should because no one designed for cost economics from the start.

How long does it take to ship a production LLM system?

A focused prompt-engineered + RAG production LLM application: 4 to 8 weeks. A LoRA / QLoRA fine-tuning engagement to production deployment: 6 to 12 weeks (most of that is data curation + eval harness, not training). Full LLM platform build (multi-model serving + routing + fine-tuning pipeline + eval + observability): 3 to 6 months. Domain-specific continued pre-training: 6 to 12 weeks training + 4 to 6 weeks eval and deployment. Fastest credible timeline to first measurable business outcome on a focused workload: 4 to 5 weeks if data is clean and use case is bounded.

How does LLM development relate to AI agent development and RAG?

They form a stack. LLM development (this page) is the model layer , selection, fine-tuning, serving, evaluation. RAG development services is the retrieval-grounding layer , pipelines that pipe your enterprise knowledge into model context so reasoning is grounded in real documents. AI agent development is the autonomous-agent architecture , systems that plan, take actions, call tools, and traverse enterprise systems built on LLM foundations. Most real programs need all three. We typically lead with model + serving architecture, layer in RAG for grounding, then build agent orchestration on top.

How do you handle prompt injection and other LLM security threats?

Defense-in-depth across input and output. (1) Input guardrails , Llama Guard, OpenAI Moderation, custom classifiers detect prompt injection patterns, jailbreak attempts, PII before they reach the model. (2) Structured prompts , clear separation between system instructions and user input prevents many injection patterns. (3) Output guardrails , toxicity detection, regulated-content filtering, hallucination flagging on model output before it reaches users or downstream systems. (4) Tool-call sandboxing , agents that execute code or call sensitive APIs run in sandboxed environments with limited blast radius. (5) Audit logging , every prompt + response logged for incident response and forensic review. (6) Eval coverage , adversarial test cases included in the eval harness to catch new injection patterns as they emerge. Built on Llama Guard, OpenAI Moderation API, NVIDIA NeMo Guardrails, Lakera Guard, and custom classifier guardrails per workload.

LLM Development | Fine-Tuning, Hosting & Apps

LLM Development Services We Deliver

Custom LLM Fine-Tuning

Domain-adapt frontier and open-source LLMs to your data , legal, medical, financial, technical vocabulary; brand voice; internal-policy reasoning. We use parameter-efficient fine-tuning (LoRA, QLoRA, IA3) for cost-effective adaptation and full fine-tuning when the domain shift demands it. Typical outcome: 15 to 35 point quality lift on domain-specific tasks vs base models at <5% of full-fine-tune cost.

Domain-Specific LLM Pre-Training

For organisations with massive proprietary corpora (10B+ tokens of clean domain text), we lead continued pre-training on a base model , typically Llama 3.3 8B / 70B or Qwen 3 , to bake domain understanding into the weights themselves. Used by clients in pharma, financial services, and legal where retrieval alone isn’t enough.

Preference Optimisation (DPO / RLHF / RLAIF)

Align fine-tuned models to human preferences using Direct Preference Optimisation (DPO), Reinforcement Learning from Human Feedback (RLHF), or Reinforcement Learning from AI Feedback (RLAIF). Used to align models with brand voice, regulatory tone, or specific decision-making patterns. Critical for high-stakes generative use cases.

LLM Serving Infrastructure

vLLM for high-throughput open-source LLM serving (continuous batching, paged attention, multi-LoRA serving). NVIDIA Triton + TensorRT-LLM for GPU-optimised production deployments. NVIDIA NIM for enterprise-grade microservice serving. SGLang for structured generation. We design serving infrastructure for your latency budget, throughput requirement, and cost constraints.

Model Selection & Routing

For multi-LLM stacks (the production default in 2026), we build smart routing layers that pick the right model per request based on complexity, latency budget, sensitivity classification, and cost. GPT-5 for nuanced reasoning, Claude Sonnet 4.6 for long-context analysis, Gemini 2.5 Flash for high-volume cheap calls, fine-tuned open-source for domain-specific deterministic tasks. Smart routing typically delivers 5 to 15× cost advantage versus single-model architectures.

Evaluation Harnesses

Every LLM we ship is paired with a custom eval use , golden datasets, automated metric tracking, human-in-the-loop validation panels, regression detection. Built on OpenAI Evals, Anthropic eval tooling, LangSmith, Braintrust, or custom frameworks. Without rigorous evaluation, LLM quality silently regresses as prompts and models evolve. We treat the eval use as a first-class production deliverable, not an afterthought.

Prompt Engineering Frameworks

Production prompt engineering goes far beyond writing clever instructions. We build prompt templates with structured outputs (JSON schema validation), few-shot example selection patterns, chain-of-thought scaffolding, tool-use protocols, and meta-prompting layers. Codified in versioned prompt registries with eval-gated deployment.

LLM Observability & Operations

Production LLM stacks need traces (every prompt, response, latency, cost, model version), drift detection (output quality regression alerts), token-budget tracking per tenant / workflow / user, and incident response for prompt injection or output-quality incidents. Built on LangSmith, Langfuse, Helicone, Arize Phoenix, OpenTelemetry.

On-Premises & Sovereign LLM Deployments

For data-sovereignty, regulated, or air-gapped environments, we deploy open-source LLMs (Llama 3.3, Qwen 3, Mistral, DeepSeek) on customer infrastructure , vLLM / Ollama / Triton / NVIDIA NIM. AWS GovCloud, Azure Government, India MeitY-empanelled regions, customer-owned datacenters, hybrid-cloud topologies all supported.

2026 LLM Engineering Patterns We Implement

Reasoning Models for Planning, Fast Models for Execution

OpenAI o-series or Claude Sonnet 4.6 with extended thinking for the hard reasoning step. GPT-4.1 / Gemini 2.5 Flash for fast tool-calls and execution. Splitting these layers delivers 5 to 15× cost advantage vs always-using-reasoning-models architectures while retaining the reasoning quality where it matters.

Smart Routing Across the Frontier

Production LLM stacks route across multiple models per request. Classifier picks: “this request is high-stakes with regulatory implications → GPT-5; this is high-volume Tier-1 support summarisation → Llama 3.3 fine-tune; this is creative drafting → Claude Opus 4.6.” Routing decisions are evaluated and refined over time.

Structured Outputs Everywhere

Every model output is constrained to a JSON schema using OpenAI’s structured outputs feature, Anthropic’s tool-use JSON validation, Outlines, or external validators (Pydantic, Zod). Eliminates the “model returned malformed JSON” failure mode. Required for any LLM output that feeds a downstream system.

Long-Context Workflows

1M-token context windows (GPT-5, Claude Sonnet 4.6, Gemini 2.5 Pro) enable workflows that were previously impossible , entire contracts, multi-document research, full codebase analysis. We design retrieval + ranking patterns for context windows in the 100K-1M token range, plus context-summarisation strategies when even 1M isn’t enough.

Multi-LoRA Serving for Per-Customer Customisation

vLLM and SGLang now support efficient multi-LoRA serving , one base model serves many customer-specific LoRA adapters from a single GPU. Lets multi-tenant SaaS deliver per-tenant fine-tuned behaviour without provisioning a separate model per tenant. 50 to 100× cost reduction vs separate model deployments.

Speculative Decoding & Draft-Then-Verify

Production-mature in 2026. Smaller draft model proposes tokens; larger model verifies. Delivers 2 to 4× throughput improvement for compatible workloads with no quality regression. Available natively in vLLM, TensorRT-LLM, and NVIDIA NIM.

Eval-Gated Deployment

No prompt change, fine-tune, or model swap reaches production without passing the eval use. Evals run on every PR, on every nightly cron, on every model version bump. Quality gates block deploys automatically. This is the single most important LLM-ops practice.

Related AI Capabilities That Compose With LLMs

Enterprise AI solutions

Broader AI program LLMs sit inside.

Generative AI development

Application layer powered by LLM engineering.

RAG development services

Retrieval grounding for LLM reasoning.

AI agent development

Autonomous-agent systems built on LLM foundations.

AI integration services

Wiring LLMs into enterprise systems.

AI automation services

Business-process automation powered by LLM reasoning.

AI & ML development services

Classical ML layer in hybrid architectures.

NLP development services

For classical NLP pipelines that complement LLMs.

Sentiment analysis solutions

LLM-powered sentiment alongside classical classifiers.

AI chatbot development services

Conversational lane built on fine-tuned LLMs.

AI app development services

Application surfaces that render LLM outputs.

AI recommendation engine

Personalisation use cases combining LLMs + classical ML.

AI consulting & strategy

Executive roadmaps for LLM programs.

Hire Our LLM Engineering Team

Need LLM expertise embedded in your own team? We staff senior LLM engineers with 3+ years of production fine-tuning, serving, and evaluation experience.

Hire AI developers

Full-stack AI engineers with LLM specialisation.

Hire OpenAI developers

For GPT-5, OpenAI fine-tuning API, and Assistants API specialists.

How We Engineer Production LLM Systems

Workload Profile & Constraints

Before model selection, we map the workload , request volume, peak QPS, latency budget, sensitivity classification, geography, regulatory posture, accuracy requirements, cost ceiling. These constraints , not the latest benchmark scores , determine model selection.

Model Selection & Benchmarking

We benchmark candidate models on YOUR data, not generic public benchmarks. Production-realistic test sets, your actual prompt patterns, your output validation criteria. Public benchmarks like MMLU, HellaSwag, GPQA correlate poorly with production performance for most enterprise workloads.

Fine-Tuning Decision & Approach

Fine-tune only when it’s the right answer. For most workloads, prompt engineering + RAG + few-shot examples gets you 80% of the lift at 5% of the engineering cost. When fine-tuning IS right, we choose the lightest-weight technique that delivers the lift , LoRA / QLoRA before full fine-tune, DPO before RLHF, supervised fine-tuning before reinforcement learning.

Data Curation & Annotation

Fine-tuning quality is bounded by data quality. We design annotation guidelines, train labellers, run inter-annotator agreement analysis (Cohen’s kappa, Krippendorff’s alpha), and build golden test sets before any model training begins. The first month of most fine-tuning programs is data work, not model work.

Training, Hyperparameter Tuning, Eval

We run training on AWS Trainium / SageMaker, GCP Vertex AI Training, Azure ML, or customer-owned GPU clusters (H100 / H200 / B200 where available). Hyperparameter search via Weights & Biases sweeps. Eval after every training run on the golden set , no model gets to production without passing the eval.

Serving Architecture

Pick the right serving stack for the model and workload. vLLM with continuous batching for OSS LLMs at high throughput. Triton + TensorRT-LLM for latency-critical workloads. NVIDIA NIM for enterprise teams that want a microservice abstraction. Bedrock / Vertex AI / Azure OpenAI for managed paths with no infrastructure burden.

Routing & Hybrid Architectures

Most production LLM stacks are hybrid , frontier model for nuanced reasoning, fine-tuned open-source for deterministic high-volume tasks, classical ML for the rest. Routing logic picks per request. Designed for cost economics that scale with volume.

Safety, Guardrails, Prompt-Injection Defence

Every production LLM ships with input guardrails (prompt-injection detection, PII redaction), output guardrails (toxicity, hallucination, regulated-content filtering), and audit trails. Built on Llama Guard, OpenAI Moderation, NVIDIA NeMo Guardrails, custom classifiers, and policy engines.

Observability, Cost Telemetry, Drift Monitoring

Token usage broken down per tenant / workflow / model. Latency / quality / cost trended over time. Drift alerts when model behaviour shifts outside historical bounds. Without observability, LLM costs balloon silently and quality regresses unnoticed.

Phased Rollout & Canary Deployments

New models or fine-tunes ship behind feature flags (LaunchDarkly, Unleash, Statsig). Canary on small traffic percentages. Compare against baseline on quality + cost + latency. Promote to full deployment only when canary metrics meet acceptance criteria.

Why Enterprises Choose ScalaCode for LLM Engineering

Production LLM Engineering Discipline

We treat LLMs as production systems with eval harnesses, observability, drift monitoring, and rollback paths , not as research projects. Most LLM programs we’re called in to fix were demos that got rushed into production without these foundations.
Vendor-Neutral by Design

We work with whichever model fits the workload , frontier or open-source , and decouple business logic from model specifics so swapping is a config change, not a rewrite. When a better model lands, you benefit without re-engineering.
Cost Engineering as a First-Class Concern

Production LLM stacks routinely burn 5 to 10× more than they should because no one designed for cost economics. Smart routing, structured outputs, multi-LoRA serving, speculative decoding, and disciplined prompt design typically cut LLM bills 60 to 80% with no quality regression.
Domain Adaptation Expertise

We’ve fine-tuned LLMs for healthcare, legal, financial services, and regulated industries , including the data-curation, annotation-quality, and evaluation rigour that off-the-shelf consultants skip. Our domain-adapted models routinely outperform off-the-shelf APIs by 15 to 35 points on domain tasks.
On-Premises & Sovereign Deployments

We’ve shipped LLM stacks to customer datacenters, AWS GovCloud, Azure Government, India MeitY-empanelled regions, and air-gapped environments. Open-source serving infrastructure (vLLM / Triton / NIM) is part of our default toolkit, not a bolt-on.
End-to-End Delivery

Model selection, fine-tuning, serving infrastructure, evaluation, observability, governance, and ongoing operations under one roof. No handoffs to a system integrator that loses context.

Industries Where We've Shipped LLM Systems

Financial Services & Banking

Fine-tuned LLMs for KYC reasoning, fraud investigation summarisation, regulatory-document analysis, internal-policy-Q&A copilots. Aligned with SR 11-7 model risk management. Often deployed with on-premises or sovereign-cloud serving for data-residency.

Healthcare & Life Sciences

Domain-pretrained models (BioGPT, Med-PaLM 2 fine-tunes, custom Llama 3.3 fine-tunes on PubMed) for clinical documentation, prior-authorization reasoning, pharmacovigilance case extraction. HIPAA-aligned with PHI isolation.

Legal

Custom LLMs for contract review, clause extraction, regulatory change summarisation, case-law research. Often paired with GraphRAG architectures for precedent and clause-relationship reasoning.

Retail & E-Commerce

LLMs for product description generation at scale, conversational shopping assistants, customer review summarisation, sentiment-tagged catalog enrichment. Often paired with our sentiment analysis solutions for review-driven product insights.

SaaS & Enterprise Software

Multi-tenant LLM platforms with per-customer LoRA fine-tuning. Custom in-app copilots. AI-powered admin assistants. Embedded analytical reasoning. We've shipped LLM features for SaaS clients with single-day fine-tune-to-deploy pipelines.

Media & Publishing

Custom LLMs for content generation, editorial assistance, content moderation, archive summarisation. Often fine-tuned on the publisher's own back catalogue for voice consistency.

Engineering & Developer Tools

Code-completion models (custom Code Llama, DeepSeek Coder fine-tunes), code-review assistants, internal-documentation generators. Integrated with GitHub Copilot extensions, JetBrains plugins, or custom IDE integrations.

Engagement Models for LLM Development

LLM Discovery & Architecture Sprint (2 to 4 weeks)

Workload audit, model selection benchmarking on your data, fine-tuning vs prompt-engineering decision, serving architecture proposal, cost projection. Starting at $25k-$50k.

Custom Fine-Tuning Engagement (4 to 10 weeks)

Data curation, annotation, training, evaluation, deployment to production serving infrastructure. Includes eval use setup. Typical price $80k-$300k depending on model size and data complexity.

Full LLM Platform Build (3 to 6 months)

Custom LLM stack with multi-model serving, smart routing, fine-tuning pipeline, eval harnesses, observability, governance, multi-tenant isolation. Typical for SaaS clients building LLM features as a platform capability.

Domain-Specific Pre-Training

For organisations with massive proprietary corpora. Continued pre-training of an open-source base model on your domain data. 6 to 12 weeks engagement; price scales with corpus size and target model.

Dedicated LLM Engineering Team

Embedded squad , LLM engineer, ML engineer, MLOps engineer, eval specialist, infrastructure engineer , running with your team for 6+ months.

Managed LLM Operations

Post-launch operations: eval re-runs on model updates, prompt drift management, cost optimisation, incident response, new model evaluation as the frontier evolves. SLA-backed.

Our Clients’ Success Stories

AI-based Reputation Management Platform for Tour Operators

Python, OpenAI, AWS, PostgreSQL, MongoDB, EC2

Travel
Italy Market

ScalaCode developed TourReview, an AI-based platform designed to aggregate and analyze customer testimonials from various online sources. This solution provides…

Empowering Vehicle Owners with an AI-Driven Mobile App for Enhanced Security, Connectivity, and Control

OpenAI, Python, Swift, Kotlin, AWS, Stripe

Automotive
US Market

In the face of growing vehicle management challenges for households with multiple vehicles, CarKenny Inc., in collaboration with ScalaCode, sought…

Leveraging AI for Proactive Maintenance in Logistics Warehouses

Python, scikit-learn, IoT sensors, Node.js, Vue.js, MongoDB

Logistics
US Market

A global logistics provider sought a solution to minimize equipment downtime and enhance operational efficiency in their warehouses using predictive…

Planwise: AI-Powered Electrical Takeoff & Material Estimation Platform

React, Tailwind, Node.js, Google Vision API, PostgreSQL, Amazon S3

Real Estate
US Market

ScalaCode partnered with an emerging construction technology company to build an AI-powered web-based SaaS platform that automates electrical takeoff and…

TryStyle: AI-Powered Virtual Try-On for Fashion

Python, Flutter, PyTorch

eCommerce
US Market

TryStyle was launched to solve a fundamental challenge in fashion eCommerce: helping users confidently explore and visualize outfits before purchasing.…

Browse All

LLM Development Technology Stack

Foundation Models

OpenAI GPT-5 GPT-4.1 o-series Sonnet 4.6 Opus 4.6 Haiku 4.5 Google Gemini 2.5 Pro / Flash Llama 3.3 / 4 Mistral Large / Small / Pixtral Qwen 3 DeepSeek-V3 / R1 Phi-4 Gemma 3 BioGPT FinBERT LegalBERT custom-pretrained models

Fine-Tuning Frameworks

Hugging Face Transformers + PEFT Axolotl LLaMA-Factory Unsloth Together AI fine-tuning API OpenAI fine-tuning API Anthropic fine-tuning API Custom training pipelines

Preference Optimisation

TRL OpenRLHF Custom RLAIF pipelines

Serving

vLLM NVIDIA Triton + TensorRT-LLM NVIDIA NIM SGLang Together AI inference Anyscale Endpoints Anthropic API OpenAI API Azure OpenAI Service AWS Bedrock Google Vertex AI

Evaluation

OpenAI Evals Anthropic eval tooling LangSmith Langfuse Braintrust Promptfoo RAGAS HELM Custom golden-set frameworks

Observability & Cost

LangSmith Langfuse Helicone Arize Phoenix Weights & Biases OpenTelemetry Custom cost-attribution dashboards

Safety & Guardrails

Llama Guard OpenAI Moderation API NVIDIA NeMo Guardrails Lakera Guard custom classifier guardrails

Training Infrastructure

AWS Trainium SageMaker GCP Vertex AI Training Azure ML CoreWeave Lambda Labs Customer-owned GPU clusters DeepSpeed FSDP Megatron-LM

LLM Outcomes We've Delivered

Top-5 European bank

Llama 3.3 70B fine-tuned on internal regulatory corpus for KYC reasoning. Domain accuracy 71% → 94%. Replaced GPT-4 calls for KYC pipeline; cost per case dropped 89%.

US healthcare network

Domain-pretrained Llama 3.3 8B for prior-authorization reasoning (HIPAA-bounded, on-prem). Auth turnaround time 5.1 days → 11 hours. Denial rate dropped 27%.

Enterprise SaaS

Multi-LoRA serving for 240 customer-specific fine-tunes on a single base model. GPU footprint cut 95% vs separate-model baseline. Per-customer customisation latency <200ms.

Tier-1 retailer

Smart routing across GPT-5 / Claude Sonnet 4.6 / Gemini Flash / Llama fine-tune for product description generation. LLM bill cut 73% with quality scores held within 1 point of always-GPT-5 baseline.

Legal tech vendor

DPO-aligned Mistral Large for contract clause extraction. Output passed senior-paralegal review on 92% of test cases vs 67% for prompt-engineered Claude 4.6 baseline.

Customer support platform

Eval use rollout caught 14 prompt regressions in first month that would have shipped to production. Quality stability +18 points on user-facing CSAT scores.

Frequently Asked Questions

Should we fine-tune an LLM or use prompt engineering + RAG?

Default to prompt engineering + RAG for the first 80% of LLM workloads. Fine-tuning adds engineering cost (data curation, annotation, training infrastructure, eval harnesses, ongoing model maintenance) that often isn’t justified when prompt + RAG gets you within 5 quality points. Fine-tuning IS the right answer when: (a) you need consistent brand voice or response format that prompts can’t reliably enforce; (b) latency or cost requires a smaller model, and a fine-tuned small model beats a prompt-engineered large model; (c) your domain vocabulary or reasoning patterns are sufficiently different from base-model training data that retrieval alone doesn’t close the gap. Most fine-tuning programs we’re called in to fix should have stayed at prompt + RAG.
Which model should we use , GPT-5, Claude Sonnet 4.6, Gemini 2.5, or open-source?

Depends on the workload profile. GPT-5 and Claude Sonnet 4.6 offer the strongest reasoning and tool-use reliability , use them for steps where decisions matter and your data can be processed in the cloud. Claude Opus 4.6 wins on long-context analysis (1M tokens) and complex multi-step reasoning. Gemini 2.5 Flash wins on cost for high-volume cheap calls. Open-source (Llama 3.3, Qwen 3, Mistral, DeepSeek) wins when you need on-premises deployment for sovereignty, when transaction volume justifies a smaller fine-tuned model, or when unit cost is a hard constraint. Most production LLM stacks we ship use a hybrid with smart routing across multiple models , not single-model standardisation.
What's the difference between LoRA, QLoRA, and full fine-tuning?

LoRA (Low-Rank Adaptation) trains small adapter matrices on top of frozen base-model weights , typically 0.1 to 1% of full-fine-tune parameters. Fast, cheap, multi-LoRA serving on a single GPU is supported. Right answer for most fine-tuning needs. QLoRA (Quantised LoRA) loads the base model in 4-bit precision while training the LoRA adapters , lets you fine-tune 70B+ models on a single GPU. Right answer when you need bigger models on smaller hardware. Full fine-tuning updates all model weights , best quality lift, dramatically higher cost (10 to 100× LoRA), single model per deployment (no multi-LoRA serving). Right answer when LoRA’s quality lift isn’t enough and you have the data + budget.
How much does it cost to fine-tune an LLM?

Discovery and architecture sprints: $25k-$50k. LoRA / QLoRA fine-tuning engagement (data curation + annotation + training + eval + production deployment): $80k-$300k depending on model size, data complexity, and required eval rigour. Full fine-tuning of frontier-class models: $200k-$1M+ depending on model. Domain-specific continued pre-training: $300k-$1.5M depending on corpus size. Ongoing serving cost depends on volume + model , open-source on vLLM typically lands $0.0001-$0.001 per 1K tokens; OpenAI GPT-5 fine-tunes typically $0.005-$0.02 per 1K tokens. Most fine-tuning programs we’ve shipped pay back within 6 to 14 months on cost-per-call savings vs always-GPT-5 baseline.
Can we run LLMs on-premises or in air-gapped environments?

Yes. For sovereignty, regulatory, or air-gapped requirements we deploy open-source models (Llama 3.3, Qwen 3, Mistral, DeepSeek) on customer infrastructure using vLLM, Ollama, NVIDIA NIM, or Triton Inference Server. Healthcare PHI workloads, banking SR 11-7 model risk requirements, defence / government / public sector all routinely require this. Hybrid deployments (on-prem for sensitive, cloud frontier for non-sensitive) are increasingly common , we design these from the workload constraints, not vendor preference.
What's an LLM eval use and why does every production system need one?

An eval use is a curated set of test cases (golden datasets), automated metrics, and acceptance thresholds that every prompt change, fine-tune, or model swap must pass before reaching production. Without it, LLM quality silently regresses as prompts and models evolve , and you don’t notice until users complain or business metrics drop. We treat the eval use as a first-class production deliverable: built day one, expanded over time, and integrated into CI/CD as a deploy gate. The single highest-ROI engineering practice in LLM development. Most production-quality LLM systems we ship have 200 to 2,000 golden test cases by month six.
How do you optimise LLM costs at scale?

Five levers. (1) Smart routing , frontier models for nuanced reasoning, fine-tuned open-source for high-volume deterministic tasks; classifier picks per request. (2) Reasoning-vs-execution split , reasoning models for planning, fast cheap models for execution. (3) Structured outputs , eliminates the “model wrote prose when we needed JSON” retry loop. (4) Multi-LoRA serving , one base model serves many customer-specific adapters from one GPU. (5) Speculative decoding , draft model proposes tokens, larger model verifies; 2 to 4× throughput gain. Combined, these typically cut LLM bills 60 to 80% with no quality regression. Most production LLM stacks burn 5 to 10× more than they should because no one designed for cost economics from the start.
How long does it take to ship a production LLM system?

A focused prompt-engineered + RAG production LLM application: 4 to 8 weeks. A LoRA / QLoRA fine-tuning engagement to production deployment: 6 to 12 weeks (most of that is data curation + eval use, not training). Full LLM platform build (multi-model serving + routing + fine-tuning pipeline + eval + observability): 3 to 6 months. Domain-specific continued pre-training: 6 to 12 weeks training + 4 to 6 weeks eval and deployment. Fastest credible timeline to first measurable business outcome on a focused workload: 4 to 5 weeks if data is clean and use case is bounded.
How does LLM development relate to AI agent development and RAG?

They form a stack. LLM development (this page) is the model layer , selection, fine-tuning, serving, evaluation. RAG development services is the retrieval-grounding layer , pipelines that pipe your enterprise knowledge into model context so reasoning is grounded in real documents. AI agent development is the autonomous-agent architecture , systems that plan, take actions, call tools, and traverse enterprise systems built on LLM foundations. Most real programs need all three. We typically lead with model + serving architecture, layer in RAG for grounding, then build agent orchestration on top.
How do you handle prompt injection and other LLM security threats?

Defense-in-depth across input and output. (1) Input guardrails , Llama Guard, OpenAI Moderation, custom classifiers detect prompt injection patterns, jailbreak attempts, PII before they reach the model. (2) Structured prompts , clear separation between system instructions and user input prevents many injection patterns. (3) Output guardrails , toxicity detection, regulated-content filtering, hallucination flagging on model output before it reaches users or downstream systems. (4) Tool-call sandboxing , agents that execute code or call sensitive APIs run in sandboxed environments with limited blast radius. (5) Audit logging , every prompt + response logged for incident response and forensic review. (6) Eval coverage , adversarial test cases included in the eval use to catch new injection patterns as they emerge. Built on Llama Guard, OpenAI Moderation API, NVIDIA NeMo Guardrails, Lakera Guard, and custom classifier guardrails per workload.

LLM Development & Custom Fine-Tuning Built for Production