AI LAB · MODEL ENGINEERING

Models, engineered to ship.

We right-size, fine-tune, compress and optimise models until they run fast, lean and offline — on mobile, edge and consumer GPUs, not just in the cloud.

Most teams treat a large model like a magic wand. We treat the model as something to be engineered — chosen for the task, fine-tuned on the right data, and compressed to fit the device it runs on.

A model is probabilistic — it gets better with work, but it never reaches 100%. So where correctness matters, we keep it out of the model's hands: the model interprets, deterministic code executes, and a human stays in the loop.

·WHAT WE DO

The discipline behind a shipped model.

Getting a model into production is rarely about a bigger model — it's about the right one, engineered to run.

01

Right-sizing & model selection

Choosing the smallest architecture that actually solves the task — encoder vs generative, on-device SLM vs hosted API — and killing over-engineering before it ships.

02

Fine-tuning & custom models

Task-specific models trained on the right data — including custom NER and synthetic datasets generated when no public data exists.

03

Compression

Distillation, pruning and 4/8-bit quantization to shrink footprints by orders of magnitude without losing the task.

04

Edge & on-device deployment

ONNX and TensorRT runtimes, CPU-only and mobile inference, offline-first and zero-API where privacy and latency demand it.

05

Inference optimization

Squeezing latency and cost out of generative models so they run in real time on consumer-grade hardware.

06

Hybrid architectures

ML for interpretation, deterministic code for execution — with humans in the loop where correctness can't be left to a probabilistic model.

·DEEP WORK

Engineering that earned its footprint.

Two builds where the model itself was the hard problem.

On-device NLP · mobileSaySplit

From a 3B-parameter LLM to a 26.6MB on-device model

A voice-first, on-device expense app needed 100% local inference — zero API runtime, full privacy, a sub-50MB footprint. After 3B LLMs, model distillation and 4-bit quantization all hit a wall around 92MB, we re-framed the problem from text generation to entity extraction, fine-tuned an encoder (MobileBERT) for custom NER, and exported it to ONNX at 8-bit.

  • 3B-parameter LLMs → a 26.6 MB production model
  • Custom NER on a synthetic dataset; single-pass inference, no autoregressive generation loop
  • Hybrid design — the model extracts, deterministic code does the math: 0% calculation error
Generative AI · avatarsSalesSpark

Real-time talking-head video, optimised for consumer GPUs

For an AI outreach product that generates lifelike talking-head video from a single image and an audio clip, we optimised the model and inference pipeline so long-duration, lip-synced video runs in real time on lower-cost, consumer-grade GPUs — without sacrificing identity or quality.

  • Real-time lip-synced avatar video from image + audio
  • Inference & model optimization for consumer-grade GPUs
  • Lightweight deployment with quality and identity preserved
·THE STACK

What we work with.

The models, training methods, compression and runtimes behind every shipped model.

Models & architecturesEncoder transformers (MobileBERT) · custom NER · small language models · generative & talking-head models
Training & dataFine-tuning · distillation · synthetic dataset generation · error-analysis loops
Compression4/8-bit quantization · post-training quantization · pruning
Runtimes & deploymentONNX Runtime · TensorRT · onnxruntime-react-native · NNAPI · CPU / mobile / edge / consumer GPU
ArchitectureHybrid ML + deterministic logic · human-in-the-loop · offline-first, zero-API