Running NVIDIA SANA-WM on RTX 5090 and H100

NVIDIA SANA-WM is a 2.6B-parameter open-source world model that generates minute-scale 720p videos with precise 6-DoF camera control. Unlike standard text-to-video models, SANA-WM takes a single image and camera trajectory as input, producing immersive first-person walkthrough videos.

We tested SANA-WM on two GPU configurations:

RTX 5090 (32GB VRAM) — RunPod community cloud
H100 SXM (80GB VRAM) — RunPod secure cloud

Our orchestration framework sana_WM handles the full setup, inference, and result download over SSH from a local WSL environment.

Input Images

We used four diverse scenes to test SANA-WM’s capabilities:

Mansion — Grand interior entrance

Hiking Trail — Mountain forest path

Mars Outpost — Futuristic research station

Modern House — Minimalist interior

RTX 5090 Results (32GB VRAM)

The RTX 5090 can only run stage-1 generation due to VRAM constraints. The second-stage refiner (which requires ~75GB VRAM) must be disabled using --no_refiner --offload_vae.

Mansion Walkthrough — Progressive Duration Test

5 Seconds (81 frames)

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-40 | Gen time: ~35s

10 Seconds (161 frames)

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-80,d-40,w-40 | Gen time: ~1m 15s

20 Seconds (321 frames)

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-100,d-30,w-60 | Gen time: ~3m 30s

Why RTX 5090 Fails at 30+ Seconds

Attempting 30-second generation (481 frames) on RTX 5090 results in CUDA Out of Memory:

torch.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 1.2 GiB. GPU has 31.74 GiB total, 
28.6 GiB used by model + latents.

Key limitations:

Stage-1 only (no refiner) → visible artifacts and temporal drift
32GB VRAM exhausted at ~400 latent frames
Longer videos show increasing blur and scene morphing

H100 Results (80GB VRAM)

The H100 SXM runs the full two-stage pipeline including the 17B-parameter refiner, producing significantly higher quality output.

Mansion Walkthrough — Progressive Duration Test

5 Seconds (81 frames) — With Refiner

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-40 | Gen time: ~1m (stage-1: 31s, refiner: 1s)

10 Seconds (161 frames) — With Refiner

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-80,d-40,w-40 | Gen time: ~2m (stage-1: 55s, refiner: 5s)

30 Seconds (481 frames) — With Refiner ✓

Prompt: “Cinematic first-person walk through grand mansion entrance” | Action: w-100,d-20,w-100,a-20,w-100 | Gen time: ~4m (stage-1: 1m 45s, refiner: 14s)

Key improvements over RTX 5090:

✅ Sharper textures with refiner
✅ Reduced temporal drift
✅ 30-second generation possible
✅ Better scene consistency

H100 — Other Scenes (20 Seconds)

Hiking Trail

Prompt: “First-person hike along mountain trail with forest views” | Action: w-100,a-10,w-60 | Gen time: ~2m

Mars Outpost

Prompt: “Approach to futuristic Mars research station on red planet surface” | Action: w-120,d-20,w-40 | Gen time: ~2m

Modern House

Prompt: “Smooth forward walk through modern minimalist house interior” | Action: w-80,d-15,w-80 | Gen time: ~2m

Experimental: Long-Duration Video (40s)

We attempted a 60-second Mars interior walkthrough, but hit OOM during the refiner stage (~80GB required for 961 latent frames). Reducing to 40 seconds (641 frames) succeeded.

Mars Interior — 40 Seconds

Prompt: “First-person camera approach to Mars outpost entrance, door slides open revealing modern interior lounge with panoramic windows, glide through living area toward open-plan kitchen, pan right to wall-mounted board with sticky notes” | Action: w-150,d-30,w-100,d-60,w-100 | Gen time: ~5m

60s OOM Analysis:

VRAM breakdown for 60s (961 frames):
- Stage-1 model: ~15GB
- Refiner (17B LTX-2): ~25GB
- 961 latent frames: ~40GB
- Total: ~80GB (at limit!)

Experimental: 360° Panoramic Rotation

We tested whether SANA-WM could generate a stationary 360° rotation — the camera fixed in place, rotating to reveal the full environment.

Hiking Trail — 360° Attempt

Prompt: “Stationary camera fixed at one point on mountain hiking trail, smooth 360-degree panoramic rotation revealing complete surrounding environment, seamless circular sweep returning to exact starting viewpoint with no forward or backward movement” | Action: d-90,d-90,d-90,d-90

Modern House — 360° Attempt

Prompt: “Stationary camera positioned in center of modern minimalist house interior, smooth 360-degree panoramic rotation from fixed viewpoint, revealing complete open-plan layout, seamless circular sweep returning to exact starting position with no camera movement” | Action: d-90,d-90,d-90,d-90

Why 360° Rotation Fails

Both attempts show the camera drifting horizontally rather than rotating in place:

Training data bias: SANA-WM was trained primarily on egocentric walkthrough videos (forward movement + turns), not tripod-style panoramas
Action string semantics: The d-N action encodes “turn right N degrees while walking” — there’s no pure “rotate in place” primitive
Temporal model limitations: The model hallucinates forward motion to maintain temporal coherence with its training distribution

Workaround: For true 360° panoramas, consider NVIDIA Lyra 2.0 which explicitly supports orbital trajectories.

Summary: RTX 5090 vs H100

Metric	RTX 5090 (32GB)	H100 SXM (80GB)
Max video duration	~20s	~40s
Refiner enabled	❌ No	✅ Yes
Typical quality	Blurry, drift	Sharp, stable
Cost (RunPod)	~$1.50/hr	~$3.00/hr
20s video cost	~$0.10	~$0.10
Best use case	Quick tests	Production

Recommendation: Use H100 for any video over 10 seconds or where quality matters. RTX 5090 is viable for rapid prototyping under 20 seconds.

Getting Started

Full setup instructions and config examples: github.com/tech-microcosm/sana_WM

# Clone and configure
git clone https://github.com/tech-microcosm/sana_WM.git
cd sana_WM
cp .env.example .env
# Edit .env with your RunPod pod IP and SSH port

# Run inference
python main.py infer -c config/examples/mansion_wm_simple.yaml \
    --host <pod-ip> --port <ssh-port>