Track2View: 4D-Consistent
Camera-Controlled Video Generation
via Paired 3D Point Tracks

1Washington University in St. Louis    2University of Copenhagen
Paper arXiv Code (Coming Soon) Weights (Coming Soon)

Track2View re-renders a source video along an arbitrary target camera trajectory, using paired 3D point tracks as an explicit, temporally continuous link between the source and the novel view.

Abstract

TL;DR — Sparse 3D point tracks give video diffusion models an explicit, temporally continuous geometric link between source and target views.

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30–65% and translation error by 61–72% relative to leading baselines.

Comparison with State-of-the-Art

Qualitative

Every method re-renders the same source video under the same target camera trajectory. All clips play in sync — click any video to pause or resume them all together.

Source
Ours
ReCamMaster
Gen3C

Quantitative

On the RealCam-Vid benchmark (400 videos: 200 static from RealEstate10K, 200 dynamic from MiraData). Each baseline is matched at its native frame length; Ours highlighted, improvement vs. that baseline in green.

Method Frames Visual Quality View Sync. Camera Accuracy
FID↓CLIP-T↑CLIP-F↑ CLIP-V↑Mat.Pix.↑ RotErr°↓TransErr↓
Trajectory Attention*25 38.2528.5798.4994.870.8263.540.309
Track2View (Ours)25 33.9529.0699.2895.811.070 1.24−65%0.085−72%
TrajectoryCrafter*49 29.3428.8798.7094.700.7372.120.721
Track2View (Ours)49 29.3029.0799.3594.710.876 1.31−38%0.280−61%
Gen3C81 30.3228.5599.0492.470.6444.213.715
ReCamMaster81 33.8528.4899.3392.930.5792.202.096
Track2View (Ours)81 26.8228.8999.3893.220.695 1.55−30%0.818−61%

* Camera accuracy evaluated in I2V mode (trajectory relative to the source frame).

How It Works

Track2View conditions a video diffusion transformer on paired 3D point tracks — sparse scene-point trajectories projected into both the source and target camera views. These tracks give the model an explicit, temporally continuous link between what content appears where and when, replacing the noisy renderings and implicit correspondences used by prior methods.

Overview of the Track2View framework
Overview of Track2View.  ⊕ denotes element-wise addition; PE denotes Fourier positional encoding; N is the number of point tracks.
1

Paired 3D Track Estimation

A 3D point tracker (SpatialTrackerV2) jointly recovers the source camera poses and sparse 3D point tracks from the input video. Given a user-specified target trajectory, the same 3D points are reprojected into the target view — producing one-to-one paired tracks, split into screen-space coordinates (xy) and per-view depth (z).

2

Dual-View Track Conditioner

The core module turns sparse tracks into dense conditioning tokens. It bilinearly samples source video features along each track, aggregates them across time with an 8-layer transformer (filling in frames where a point is occluded), then scatters them back into both views. Every geometric operation is parameter-free, so camera geometry is encoded directly rather than memorized.

3

Track-Conditioned Diffusion

The dual-view track tokens are added to the source and noised-target video tokens and fed to a pretrained DiT backbone (WAN-2.1). Only lightweight LoRA adapters and the conditioner are trained — the base video prior stays frozen. The model denoises the target view and decodes it into the final re-rendered video.

Training Data Curation

Training needs paired videos of the same scene from two viewpoints, with one-to-one track correspondences. We temporally reverse the source camera and concatenate it with the target camera — because both share an identical first frame, this forms a seamless 161-frame clip. Running the tracker once over this sequence yields tracks that span both segments, establishing one-to-one source ↔ target correspondences by construction — no explicit matching, and no learned parameters in the geometry path.

Paired track extraction pipeline
Paired track extraction.  The source clip (camera A) is reversed and concatenated with the target clip (camera B), sharing a first frame. Matching colors across the source (top) and target (bottom) rows mark one-to-one 3D point correspondences.

BibTeX

@article{track2view2026,
  title   = {Track2View: 4D-Consistent Camera-Controlled Video Generation
             via Paired 3D Point Tracks},
  author  = {Feng Qiao and Zhaochong An and Zhexiao Xiong and Serge Belongie and Nathan Jacobs},
  journal = {arXiv preprint arXiv:2606.15534},
  year    = {2026}
}