Build a Diffusion Model From Scratch to Generate Game Sprites

Most of what I understood about diffusion models came from reading papers and nodding along like I got it. To actually get it, I needed to build one end to end — no Hugging Face diffusers, no pretrained weights, no library magic quietly doing the hard parts for me. So I wrote pose-sprite-diffusion: a small conditional diffusion model, trained entirely from scratch, that turns a stick-figure skeleton into a 64×64 character sprite. Feed it a different pose and the same kind of character shows up in that new pose.

It is unapologetically a learning project — the goal was understanding, not shipping a product. But the thing genuinely works, and the core idea turned out to be simpler than the literature makes it sound.

The idea: pose in, sprite out

A diffusion model learns to turn pure noise into an image by reversing a gradual noising process. On its own, that gives you unconditional samples — random characters in random poses. I wanted control. Specifically, I wanted to hand the model a pose and have it respect that pose every time.

Conceptually this is a conditional diffusion U-Net. People reach for the ControlNet comparison, but it is not the same thing. ControlNet bolts a trainable adapter onto a frozen, pretrained model. Here there is nothing pretrained to freeze — the whole network is small and trained from scratch, and the pose is baked directly into the model's input. The closest references are PG2 (pose-guided person generation) and VITON, which both condition on pose information.

Conditioning is just extra channels

Here is the part that surprised me with how simple it is. The skeleton is not drawn as a stick figure and handed over as an RGB image. Instead, each of the 14 joints becomes its own heatmap — a single channel with a Gaussian blob centered on that joint. Stack those 14 heatmaps and you get a 14-channel tensor that says, precisely, where every joint sits.

Then you concatenate those channels onto the noisy image the model is denoising. The U-Net stops seeing 3 input channels and starts seeing 3 + 14 = 17. That is the entire conditioning mechanism:

# noisy_image: (B, 3, 64, 64)  — what we are denoising
# heatmaps:    (B, 14, 64, 64) — one Gaussian blob per joint
model_input = torch.cat([noisy_image, heatmaps], dim=1)  # (B, 17, 64, 64)
noise_pred  = unet(model_input, timestep)                # predicts (B, 3, 64, 64)

No cross-attention, no adapter, no separate control branch. The pose rides along in the input, and because it is present at every denoising step, the model has no real choice but to use it. Encoding joints as per-channel heatmaps instead of a rendered stick figure matters too: at 64 pixels a thin drawn line is fragile, but a dedicated channel per joint is a clean, unambiguous signal.

The training objective

The model is trained to do one boring, reliable thing: predict the noise. I take a real sprite, add a known amount of Gaussian noise following a cosine schedule, and ask the U-Net — given the noisy sprite, the timestep, and the pose channels — to predict the noise that was added. The loss is plain mean squared error between the predicted and actual noise. No GAN, no adversary, no perceptual loss. Just MSE, which is a big part of why diffusion training feels so stable compared to the GANs I have fought with before.

Free, infinite, perfectly labelled data

Training from scratch needs data, and good pose-labelled sprite data is annoying to find. So I sidestepped the problem entirely and generate the training data procedurally. A small renderer builds crude colored stick-figure characters from a randomized skeleton rig, and because I place the joints myself, every training image comes with perfect ground-truth joint positions for free.

This is a quietly powerful trick for a learning project. The dataset is effectively infinite, costs nothing, and carries zero labelling noise. Before touching real sprite art, I can prove the core claim — that skeleton conditioning actually controls the output — on data where I control every single variable.

Running it on a budget

All the real training happens on a free Colab T4, and that constraint shaped the engineering more than I expected: mixed-precision training to fit inside 15 GB of VRAM, an EMA copy of the weights for cleaner samples, and frequent checkpointing because a free Colab session can disconnect at any moment. Sampling uses a fast DDIM loop — roughly 50 steps instead of the full 1000 — so I can actually look at results without waiting around.

What I took away from it

•Conditioning can be almost embarrassingly simple. Concatenating channels is enough to get real control — you do not need attention or a second network to get started.
•Representation beats cleverness. Per-joint heatmaps worked far better than a drawn skeleton, and that one choice mattered more than any hyperparameter I tuned.
•Predicting noise is what makes diffusion pleasant. The training loop is stable and forgiving in a way GANs never were for me.
•Synthetic data is a superpower for learning. Perfect labels and infinite samples let you debug the idea instead of the dataset.

Try it yourself

The whole thing is open source and deliberately readable — the U-Net, the diffusion schedule, the data renderer, and the training loop are each a small, self-contained module. If you are trying to understand diffusion by building rather than reading, it is a friendly place to start. The code lives on GitHub: github.com/bilaltahseen/pose-sprite-diffusion.

Where it goes next

The procedural stick figures were always meant to be a proving ground. Now that skeleton-to-sprite converges reliably, the obvious next step is swapping the crude renderer for real sprite art and seeing how far the same dead-simple conditioning carries. If you build something with it, or you have ideas to push it further, I would genuinely love to hear them.

How I Built a Pose-Conditioned Diffusion Model From Scratch to Generate Game Sprites