Applied AI: RAG and Fine-tuning
I fine-tune open models, mostly Gemma 4, on a single A100, and build RAG over real databases. The unglamorous layer that decides whether an AI product is trustworthy or just a good demo.
Two jobs people keep confusing
RAG and fine-tuning solve different problems, and mixing them up is the classic rookie move. RAG adds knowledge. Fine-tuning changes behaviour. If the complaint is the model does not know something, you retrieve. If the complaint is the model does not act the way you need, you fine-tune. Bake fresh facts into weights and they just go stale and expensive.
RAG: grounding the model in real data
Most retrieval failures are retrieval failures, not model failures. If the right context never gets fetched, no model on earth saves you. So I score retrieval on its own, separate from generation, and I make the system say I do not know instead of confidently inventing. Vectorization, chunking and freshness are product decisions, not afterthoughts.
Fine-tuning: I mostly fine-tune Gemma 4
Gemma 4 is open weight, Apache 2.0, and ships in four sizes I pick between depending on the job. E2B and E4B are the small effective-2B and 4B models that run on the edge or even a phone. The 26B is a mixture-of-experts that only activates around 4B parameters per token, so you get close to big-model quality at small-model cost. The 31B dense is the one I reach for when I want maximum quality and do not care about footprint. Right-sizing the model to the task is half the work.
How I do it: one A100 80GB, LoRA and QLoRA
I do not full fine-tune. That needs a rack of GPUs and it is the wrong tool for almost every real job. On a single A100 80GB I use LoRA, and QLoRA when I want headroom: freeze the base model in 4-bit and train tiny adapters, well under one percent of the parameters. That comfortably handles the 26B MoE with LoRA and the 31B dense with QLoRA on one card.
The dataset is the actual work. A clean, consistent, well-labelled instruction set beats any clever hyperparameter every single time. I keep a frozen holdout the model never trains on, compare every run against both the base model and the previous adapter, and watch for the model getting dumber outside the target task. Specialise it too hard and it forgets how to do everything else. That trade is the job.
Why bother when an API exists
Three reasons. Cost and latency: a fine-tuned 4B can match a frontier model on a narrow task at a fraction of the price and answer in milliseconds. Control: you own the weights, there is no vendor pulling the model out from under you, and the Apache 2.0 licence means no strings. Privacy and sovereignty: it runs on your own A100, the data never leaves the building, which for regulated or sensitive crypto data is the entire ballgame.
When to do it, and when not
Reach for fine-tuning only after prompting and RAG have run out of road. It earns its keep when you need to lock a style or output format, teach the model the language of a narrow domain, or make a small cheap model punch above its weight. Do not fine-tune to add knowledge, do not fine-tune to fix a bug you could fix with a better prompt, and do not fine-tune before you have an eval suite, because then you cannot even tell if it helped.
What I use it for
Crypto is full of narrow, repetitive, structured tasks where a specialised small model beats a giant generalist. Turning messy protocol docs and governance posts into clean structured data. Classifying on-chain activity. Powering the cheap inner steps of an agent so you are not paying frontier prices on every call. And running the whole thing on-prem when the data is not allowed to leave. The eval suite still gates all of it. A fine-tune that wins on the target task but quietly breaks everything else is a regression, not a win.