Why train tiny models at all?
The core question in this experiment is sharp: how much useful behavior can you squeeze into a model so small it runs comfortably on CPU? In this case, the answer starts around 64 MB and points toward a different way to think about capability density.
Instead of asking one giant model to do everything, this line of work explores many tiny specialists. Each model can own a narrow domain and be loaded only when needed. That shifts the conversation from brute force scale to composable intelligence.
From prompt to game engine behavior
The model is trained on a compact, template-driven format representing turn-based RPG state. Inputs encode turn counters, status effects, cooldown slots, and actions like poison_strike, ignite, heal, and guard. Outputs resolve toward a canonical next-state block.
That makes the model behave like a fuzzy state-transition engine. Not deterministic code, but learned transitions with enough structure to produce coherent combat outcomes under normal conditions.
- Status channels: shield, poison, burn, regen, thorns, stun, fragile, cooldowns.
- Action grammar: attack, heal, guard, cleanse, poison strike, ignite, brace, wait.
- Goal: infer and emit consistent post-turn world state.
The useful failures
The best part of the note is not “look, it worked.” It is where behavior degrades: label collapse, token drift, and boundary confusion when malformed or overloaded labels are introduced. For example, substituting symbolic values with semantically messy variants can destabilize output shape quickly.
These failures are not just bugs. They map the model’s internal compression limits and expose where representation quality breaks down. In practical training terms, they tell you what to fix next in data format, token conventions, and eval coverage.
“Why not just write normal code?”
That objection is fair and still misses the point. Deterministic engines are often the right answer. The experiment here is different: can you grow a tiny model that internalizes narrow behavioral rules and stays useful under constrained compute?
For edge contexts (old GPUs, Raspberry Pi style deployments, CPU-first inference), this matters. A tiny trained model can become a flexible component where strict code paths are brittle or expensive to maintain across evolving behaviors.
The practical takeaway is not “replace code with models.” It is “learn where tiny learned systems beat hard-coded complexity.”
What this means for custom model training
- Define a constrained domain: tiny models win when the task envelope is explicit.
- Treat formats as architecture: token names and templates are part of model design, not just data wrappers.
- Build failure-first evals: perturb labels, action order, and edge states early to map brittleness.
- Measure capability-per-megabyte: this can be more actionable than raw benchmark scores.
- Design for composition: multiple tiny experts may beat one heavyweight generalist for real product loops.