AI Face Tracking for Vertical Video Explained (2026)

When you convert a 16:9 landscape video to 9:16 vertical format, you lose about two-thirds of the horizontal frame. If the speaker is slightly off-center-which they usually are-a simple center crop might cut off their face entirely.

AI face tracking solves this by detecting where the speaker is in each frame and dynamically adjusting the crop position. Here's how the two main approaches work, and when to use each one.

The Problem: 16:9 to 9:16 Conversion

A standard 16:9 frame is 1920 pixels wide by 1080 pixels tall. A 9:16 frame is 1080 pixels wide by 1920 pixels tall. When you crop a 16:9 frame to fill a 9:16 container, you can only use about 607 pixels of the original 1920-pixel width (at the same height). That's roughly one-third of the frame.

If the speaker is centered, a static center crop works fine. But speakers move, gesture, lean, and in multi-camera setups, they might be on the left or right side of frame. Without face tracking, you'd need to manually keyframe crop positions throughout the video-a tedious process that can take longer than the video itself.

OpenCV Face Tracking (Fast Mode)

OpenCV's Haar Cascade classifier is the simpler and faster approach to face detection. Here's how it works:

How It Works

Frame-by-frame detection: Each video frame is analyzed using a pre-trained Haar Cascade model that detects face-like patterns (eye spacing, nose position, facial proportions)
Bounding box: The detector returns a bounding box around each detected face with coordinates and size
Position stabilization: Raw frame-by-frame detection is jittery, so the position is smoothed over time using an exponential moving average. This prevents the crop from jumping between frames
Crop calculation: The crop window is centered on the stabilized face position, clamped to stay within frame boundaries

Strengths

Speed: Haar Cascade is very fast-it can process frames in real-time even on modest hardware
Reliability: Works well for single-speaker, front-facing content
Lightweight: No GPU required, minimal memory usage

Weaknesses

Profile faces: Haar Cascade works best with frontal faces. Side profiles can cause detection drops
Multiple speakers: When there are two faces in frame, it doesn't know which one to follow
Occlusion: If the speaker turns away or covers their face, detection drops until they face forward again

Best for: Single-speaker talking-head content, tutorials, vlogs, and any content where the speaker is consistently facing the camera.

MediaPipe Face Tracking (Accurate Mode)

MediaPipe's Face Mesh provides much more detailed face analysis, including landmark detection for 468 facial points. Shortzly uses this for a more sophisticated tracking approach.

How It Works

Face Mesh detection: MediaPipe identifies 468 facial landmarks per face, including precise lip positions
Lip-activity scoring: By measuring the distance between upper and lower lip landmarks over time, the system detects who is actively speaking. A configurable threshold determines what counts as "speaking"
Active speaker tracking: In multi-speaker scenarios, the crop follows the currently active speaker. A switch threshold and minimum shot duration prevent rapid back-and-forth between speakers
Center weight: A configurable center weight balances between following the speaker and maintaining a stable center position, reducing unnecessary camera movement

Strengths

Multi-speaker handling: Lip-activity scoring intelligently tracks the active speaker in podcasts, interviews, and panel discussions
Profile support: Face Mesh works with partial profiles, not just frontal faces
Smooth transitions: Speaker switches are smoothed with configurable timing parameters
Fine-grained control: Adjustable thresholds for lip activity, switch timing, and center weight

Weaknesses

Speed: 2-3x slower than OpenCV due to the complexity of 468-point landmark detection
Resource usage: Higher CPU and memory requirements
Complexity: More parameters to tune for optimal results

Best for: Podcasts, interviews, panel discussions, and any content with multiple speakers where you want the crop to follow the active speaker.

Center Crop (No Face Detection)

The simplest option-no AI at all. The video is cropped from the center of the frame with no dynamic adjustment.

Best for: Content where the speaker is always centered (like teleprompter recordings), B-roll footage, or when you want the fastest possible rendering.

When to Use Each Mode

Solo talking-head video: OpenCV (fast, reliable)
Podcast or interview: MediaPipe (active speaker tracking)
Conference panel: MediaPipe (multi-face handling)
Screen recording with camera overlay: OpenCV or center crop
B-roll or landscape footage: Center crop

Implementation in the Pipeline

Face tracking runs as part of the portrait conversion step in the clip rendering pipeline. After the AI detects and scores highlights, and after you've selected which clips to render, face tracking processes each clip frame by frame.

The system uses rawvideo stdin pipes to FFmpeg-each processed frame is piped directly to the encoder rather than writing temporary files. This keeps rendering fast and storage-efficient.

Try It Yourself

Face tracking mode is selectable from Shortzly's editor when rendering clips. Try both OpenCV and MediaPipe on the same clip to see the difference. For single-speaker content, you likely won't notice a difference. For multi-speaker content, MediaPipe's active speaker detection is a game-changer.

Start free with Shortzly and test face tracking on your own content. The AI video clipper supports all three tracking modes.

AI Face Tracking for Vertical Video: How It Works

The Problem: 16:9 to 9:16 Conversion

OpenCV Face Tracking (Fast Mode)

How It Works

Strengths

Weaknesses

MediaPipe Face Tracking (Accurate Mode)

How It Works

Strengths

Weaknesses

Center Crop (No Face Detection)

When to Use Each Mode

Implementation in the Pipeline

Try It Yourself

Ready to create viral shorts?

Related Articles

AI Face Tracking for Vertical Video: How It Works

The Problem: 16:9 to 9:16 Conversion

OpenCV Face Tracking (Fast Mode)

How It Works

Strengths

Weaknesses

MediaPipe Face Tracking (Accurate Mode)

How It Works

Strengths

Weaknesses

Center Crop (No Face Detection)

When to Use Each Mode

Implementation in the Pipeline

Try It Yourself

Ready to create viral shorts?

Related Articles

Short-Form Video CTAs: Turn Views Into Action (2026)

Hashtag Strategy for Short-Form Video in 2026 (By Platform)

Turn Your Podcast Into Short Video Clips with AI