Introducing AI Voxel Generation: From Prompt to 3D in Seconds

We built an AI pipeline that turns a single text prompt into a fully-formed, game-ready voxel model. Here's how it works under the hood.

Voxel AI generation has been in the works for a while, and today it’s live. Type a prompt like “a medieval tower with a wooden gate” and the system spins up a 3D voxel model ready to drop into your scene or export to .vox.

How it works

The pipeline runs in three stages:

1 Image Generation Prompt → voxel-tuned image model → multi-angle sprite sheet
2 Voxelization Depth estimation + voxel projection → raw 3D grid
3 Post-processing Flood-fill interior + CIEDE2000 palette reduction → clean model
The full round trip takes 8–15 seconds depending on model resolution
  1. Image generation — your prompt is sent to an image model tuned on voxel-style references, producing a consistent multi-angle sprite sheet
  2. Voxelization — a custom depth-estimation + voxel-projection step converts the 2D sheets into a 3D grid at your target resolution
  3. Post-processing — the raw voxel grid is cleaned, palette-reduced to your active color set, and surfaced as a live preview in the editor

What you can generate

Almost any object works well: characters, props, architecture, and environmental pieces. Abstract concepts and very fine text don’t voxelize cleanly — use the manual editor to refine those.

Generation examples: character, prop, architecture Image to be added
Sample outputs from a single prompt across different model categories

Generation tokens

Each generation consumes tokens from your account balance:

ActionToken cost
Image generation1 token
Voxelization3 tokens
Save to cloud0 tokens

Free accounts start with 50 tokens. Builder and Pro plans refill tokens monthly.

Under the hood

The voxelization step was the hardest part. Early versions produced hollow shells — the outer faces looked right but the interior was empty, which caused problems in game engines that rely on solid geometry for collision.

Hollow shell vs solid fill — before and after flood-fill Image to be added
Left: hollow voxel output (broken collision). Right: flood-filled solid model.

The palette reduction step maps generated colors to your active palette using perceptual distance (CIEDE2000) rather than RGB Euclidean distance — the results look significantly better, especially for skin tones and wood textures.

What’s next

We’re working on:

  • Scene generation — generate entire environments from a single description
  • Style transfer — match the palette and style of an existing model in your project
  • Animation hints — tag bone attachment points at generation time for rigging

Try it now in the editor. Feedback welcome on Discord.