Applied AI: RAG and Fine-tuning

Fine-tuning open models, mostly Gemma 4, on a single A100, and RAG over real databases. The unglamorous layer that decides whether an AI product is trustworthy or just a good demo.

Two jobs people keep confusing

RAG and fine-tuning solve different problems, and mixing them up is the classic rookie move. RAG adds knowledge. Fine-tuning changes behaviour. If the complaint is the model does not know something, you retrieve. If the complaint is the model does not act the way you need, you fine-tune. Bake fresh facts into weights and they just go stale and expensive.

RAG: grounding the model in real data

Most retrieval failures are retrieval failures, not model failures. If the right context never gets fetched, no model on earth saves you. So retrieval is scored on its own, separate from generation, and the system is built to say it does not know instead of confidently inventing. Vectorization, chunking and freshness are product decisions, not afterthoughts.

Fine-tuning: mostly Gemma 4

Gemma 4 is open weight, Apache 2.0, and ships in four sizes to pick between depending on the job. E2B and E4B are the small effective-2B and 4B models that run on the edge or even a phone. The 26B is a mixture-of-experts that only activates around 4B parameters per token, so you get close to big-model quality at small-model cost. The 31B dense is the one to reach for when maximum quality matters and footprint does not. Right-sizing the model to the task is half the work.

How it works: one A100 80GB, LoRA and QLoRA

Full fine-tuning is off the table. That needs a rack of GPUs and it is the wrong tool for almost every real job. On a single A100 80GB the approach is LoRA, and QLoRA when more headroom is needed: freeze the base model in 4-bit and train tiny adapters, well under one percent of the parameters. That comfortably handles the 26B MoE with LoRA and the 31B dense with QLoRA on one card.

The dataset is the actual work. A clean, consistent, well-labelled instruction set beats any clever hyperparameter every single time. A frozen holdout the model never trains on stays aside, every run is compared against both the base model and the previous adapter, and the model is watched for getting dumber outside the target task. Specialise it too hard and it forgets how to do everything else. That trade is the job.

Why bother when an API exists

Three reasons. Cost and latency: a fine-tuned 4B can match a frontier model on a narrow task at a fraction of the price and answer in milliseconds. Control: you own the weights, there is no vendor pulling the model out from under you, and the Apache 2.0 licence means no strings. Privacy and sovereignty: it runs on your own A100, the data never leaves the building, which for regulated or sensitive crypto data is the entire ballgame.

When to do it, and when not

Reach for fine-tuning only after prompting and RAG have run out of road. It earns its keep when you need to lock a style or output format, teach the model the language of a narrow domain, or make a small cheap model punch above its weight. Do not fine-tune to add knowledge, do not fine-tune to fix a bug you could fix with a better prompt, and do not fine-tune before you have an eval suite, because then you cannot even tell if it helped.

What it is used for

Crypto is full of narrow, repetitive, structured tasks where a specialised small model beats a giant generalist. Turning messy protocol docs and governance posts into clean structured data. Classifying on-chain activity. Powering the cheap inner steps of an agent so you are not paying frontier prices on every call. And running the whole thing on-prem when the data is not allowed to leave. The eval suite still gates all of it. A fine-tune that wins on the target task but quietly breaks everything else is a regression, not a win.