Introducing AI Voxel Generation: From Prompt to 3D in Seconds

We built an AI pipeline that turns a single text prompt into a fully-formed, game-ready voxel model. Here's how it works under the hood.

Voxel Team May 15, 2026 3 min read

Voxel AI generation has been in the works for a while, and today it’s live. Type a prompt like “a medieval tower with a wooden gate” and the system spins up a 3D voxel model ready to drop into your scene or export to .vox.

How it works

The pipeline runs in three stages:

1 Image Generation Prompt → voxel-tuned image model → multi-angle sprite sheet

2 Voxelization Depth estimation + voxel projection → raw 3D grid

3 Post-processing Flood-fill interior + CIEDE2000 palette reduction → clean model

The full round trip takes 8–15 seconds depending on model resolution

Image generation — your prompt is sent to an image model tuned on voxel-style references, producing a consistent multi-angle sprite sheet
Voxelization — a custom depth-estimation + voxel-projection step converts the 2D sheets into a 3D grid at your target resolution
Post-processing — the raw voxel grid is cleaned, palette-reduced to your active color set, and surfaced as a live preview in the editor

What you can generate

Almost any object works well: characters, props, architecture, and environmental pieces. Abstract concepts and very fine text don’t voxelize cleanly — use the manual editor to refine those.

Generation examples: character, prop, architecture Image to be added

Sample outputs from a single prompt across different model categories

Generation tokens

Each generation consumes tokens from your account balance:

Action	Token cost
Image generation	1 token
Voxelization	3 tokens
Save to cloud	0 tokens

Free accounts start with 50 tokens. Builder and Pro plans refill tokens monthly.

Under the hood

The voxelization step was the hardest part. Early versions produced hollow shells — the outer faces looked right but the interior was empty, which caused problems in game engines that rely on solid geometry for collision.

Hollow shell vs solid fill — before and after flood-fill Image to be added

Left: hollow voxel output (broken collision). Right: flood-filled solid model.

The palette reduction step maps generated colors to your active palette using perceptual distance (CIEDE2000) rather than RGB Euclidean distance — the results look significantly better, especially for skin tones and wood textures.

What’s next

We’re working on:

Scene generation — generate entire environments from a single description
Style transfer — match the palette and style of an existing model in your project
Animation hints — tag bone attachment points at generation time for rigging

Try it now in the editor. Feedback welcome on Discord.