Skip to content
Tips

AI Face Tracking for Vertical Video: How It Works

S

Shortzly Team

3 days ago

When you convert a 16:9 landscape video to 9:16 vertical format, you lose about two-thirds of the horizontal frame. If the speaker is slightly off-center — which they usually are — a simple center crop might cut off their face entirely.

AI face tracking solves this by detecting where the speaker is in each frame and dynamically adjusting the crop position. Here's how the two main approaches work, and when to use each one.

The Problem: 16:9 to 9:16 Conversion

A standard 16:9 frame is 1920 pixels wide by 1080 pixels tall. A 9:16 frame is 1080 pixels wide by 1920 pixels tall. When you crop a 16:9 frame to fill a 9:16 container, you can only use about 607 pixels of the original 1920-pixel width (at the same height). That's roughly one-third of the frame.

If the speaker is centered, a static center crop works fine. But speakers move, gesture, lean, and in multi-camera setups, they might be on the left or right side of frame. Without face tracking, you'd need to manually keyframe crop positions throughout the video — a tedious process that can take longer than the video itself.

OpenCV Face Tracking (Fast Mode)

OpenCV's Haar Cascade classifier is the simpler and faster approach to face detection. Here's how it works:

How It Works

  1. Frame-by-frame detection: Each video frame is analyzed using a pre-trained Haar Cascade model that detects face-like patterns (eye spacing, nose position, facial proportions)
  2. Bounding box: The detector returns a bounding box around each detected face with coordinates and size
  3. Position stabilization: Raw frame-by-frame detection is jittery, so the position is smoothed over time using an exponential moving average. This prevents the crop from jumping between frames
  4. Crop calculation: The crop window is centered on the stabilized face position, clamped to stay within frame boundaries

Strengths

  • Speed: Haar Cascade is very fast — it can process frames in real-time even on modest hardware
  • Reliability: Works well for single-speaker, front-facing content
  • Lightweight: No GPU required, minimal memory usage

Weaknesses

  • Profile faces: Haar Cascade works best with frontal faces. Side profiles can cause detection drops
  • Multiple speakers: When there are two faces in frame, it doesn't know which one to follow
  • Occlusion: If the speaker turns away or covers their face, detection drops until they face forward again

Best for: Single-speaker talking-head content, tutorials, vlogs, and any content where the speaker is consistently facing the camera.

MediaPipe Face Tracking (Accurate Mode)

MediaPipe's Face Mesh provides much more detailed face analysis, including landmark detection for 468 facial points. Shortzly uses this for a more sophisticated tracking approach.

How It Works

  1. Face Mesh detection: MediaPipe identifies 468 facial landmarks per face, including precise lip positions
  2. Lip-activity scoring: By measuring the distance between upper and lower lip landmarks over time, the system detects who is actively speaking. A configurable threshold determines what counts as "speaking"
  3. Active speaker tracking: In multi-speaker scenarios, the crop follows the currently active speaker. A switch threshold and minimum shot duration prevent rapid back-and-forth between speakers
  4. Center weight: A configurable center weight balances between following the speaker and maintaining a stable center position, reducing unnecessary camera movement

Strengths

  • Multi-speaker handling: Lip-activity scoring intelligently tracks the active speaker in podcasts, interviews, and panel discussions
  • Profile support: Face Mesh works with partial profiles, not just frontal faces
  • Smooth transitions: Speaker switches are smoothed with configurable timing parameters
  • Fine-grained control: Adjustable thresholds for lip activity, switch timing, and center weight

Weaknesses

  • Speed: 2-3x slower than OpenCV due to the complexity of 468-point landmark detection
  • Resource usage: Higher CPU and memory requirements
  • Complexity: More parameters to tune for optimal results

Best for: Podcasts, interviews, panel discussions, and any content with multiple speakers where you want the crop to follow the active speaker.

Center Crop (No Face Detection)

The simplest option — no AI at all. The video is cropped from the center of the frame with no dynamic adjustment.

Best for: Content where the speaker is always centered (like teleprompter recordings), B-roll footage, or when you want the fastest possible rendering.

When to Use Each Mode

  • Solo talking-head video: OpenCV (fast, reliable)
  • Podcast or interview: MediaPipe (active speaker tracking)
  • Conference panel: MediaPipe (multi-face handling)
  • Screen recording with camera overlay: OpenCV or center crop
  • B-roll or landscape footage: Center crop

Implementation in the Pipeline

Face tracking runs as part of the portrait conversion step in the clip rendering pipeline. After the AI detects and scores highlights, and after you've selected which clips to render, face tracking processes each clip frame by frame.

The system uses rawvideo stdin pipes to FFmpeg — each processed frame is piped directly to the encoder rather than writing temporary files. This keeps rendering fast and storage-efficient.

Try It Yourself

Face tracking mode is selectable from Shortzly's editor when rendering clips. Try both OpenCV and MediaPipe on the same clip to see the difference. For single-speaker content, you likely won't notice a difference. For multi-speaker content, MediaPipe's active speaker detection is a game-changer.

Start free with Shortzly and test face tracking on your own content. The AI video clipper supports all three tracking modes.

Share:

Ready to create viral shorts?

Turn your long videos into short clips with AI. Free to start, no credit card required.

Get Started Free