AI LAB · MODEL ENGINEERING
Models, engineered to ship.
We right-size, fine-tune, compress and optimise models until they run fast, lean and offline — on mobile, edge and consumer GPUs, not just in the cloud.
Most teams treat a large model like a magic wand. We treat the model as something to be engineered — chosen for the task, fine-tuned on the right data, and compressed to fit the device it runs on.
A model is probabilistic — it gets better with work, but it never reaches 100%. So where correctness matters, we keep it out of the model's hands: the model interprets, deterministic code executes, and a human stays in the loop.
The discipline behind a shipped model.
Getting a model into production is rarely about a bigger model — it's about the right one, engineered to run.
Right-sizing & model selection
Choosing the smallest architecture that actually solves the task — encoder vs generative, on-device SLM vs hosted API — and killing over-engineering before it ships.
Fine-tuning & custom models
Task-specific models trained on the right data — including custom NER and synthetic datasets generated when no public data exists.
Compression
Distillation, pruning and 4/8-bit quantization to shrink footprints by orders of magnitude without losing the task.
Edge & on-device deployment
ONNX and TensorRT runtimes, CPU-only and mobile inference, offline-first and zero-API where privacy and latency demand it.
Inference optimization
Squeezing latency and cost out of generative models so they run in real time on consumer-grade hardware.
Hybrid architectures
ML for interpretation, deterministic code for execution — with humans in the loop where correctness can't be left to a probabilistic model.
Engineering that earned its footprint.
Two builds where the model itself was the hard problem.
From a 3B-parameter LLM to a 26.6MB on-device model
A voice-first, on-device expense app needed 100% local inference — zero API runtime, full privacy, a sub-50MB footprint. After 3B LLMs, model distillation and 4-bit quantization all hit a wall around 92MB, we re-framed the problem from text generation to entity extraction, fine-tuned an encoder (MobileBERT) for custom NER, and exported it to ONNX at 8-bit.
- 3B-parameter LLMs → a 26.6 MB production model
- Custom NER on a synthetic dataset; single-pass inference, no autoregressive generation loop
- Hybrid design — the model extracts, deterministic code does the math: 0% calculation error
Real-time talking-head video, optimised for consumer GPUs
For an AI outreach product that generates lifelike talking-head video from a single image and an audio clip, we optimised the model and inference pipeline so long-duration, lip-synced video runs in real time on lower-cost, consumer-grade GPUs — without sacrificing identity or quality.
- Real-time lip-synced avatar video from image + audio
- Inference & model optimization for consumer-grade GPUs
- Lightweight deployment with quality and identity preserved
What we work with.
The models, training methods, compression and runtimes behind every shipped model.