Skip to main content

TrackNet

Deep Learning for High-Speed Tiny Object Tracking

TrackNet is a specialized deep learning architecture designed for tracking high-speed and tiny objects in broadcast sports videos. These objects (like tennis balls or shuttlecocks) are often small, blurry, and occasionally invisible due to high shutter speeds and motion.

Problem Statement

FeatureTrackNet Strategy
Small ObjectHeatmap-based detection over pixel-level coordinates
Motion BlurLearning patterns from consecutive frames (Temporal Information)
VisibilityRegression of (x, y) even when the ball is partially occluded
Architecture Core

The network is trained not only to recognize the ball from a single frame but also to learn flying patterns from consecutive frames, utilizing spatiotemporal features.

TrackNet vs. TSM (Temporal Shift Module)

The design philosophy differs significantly based on the task goal:

  • TrackNet: Goal is to estimate instantaneous velocity max((x,y,t)/t)\max(\partial(x, y, t)/\partial t). This is a regression task requiring precise spatial coordinates.
  • TSM: Goal is to classify actions P(Action_f(tk,t+k))P(\text{Action\_} \mid f(t-k, t+k)). This is a classification task over a temporal window where local pixel precision is less critical.

Comparison Matrix

FeatureTrackNetTSM
Task Aimmax((x,y,t))/t\max(\partial (x, y, t))/ \partial tP(Action_f(tk,t+k))P(\text{Action\_}\mid f(t-k, t+k))
Loss FunctionHeatmappredHeatmapgt\| \text{Heatmap}_{\text{pred}} - \text{Heatmap}_{\text{gt}} \|CE(Action_pred,Action_gt)\text{CE}(\text{Action\_}_{\text{pred}}, \text{Action\_}_{\text{gt}})
Input FocusLocal spatial features + TemporalGlobal temporal context

Evolution

TrackNetV1 (AVSS 2019)

📄 Paper: AVSS 2019

TrackNetV1

  • Input: W×H×(3 Frames×RGB)W \times H \times (3 \text{ Frames} \times \text{RGB})
  • Output: W×H×1W \times H \times 1 heatmap.
  • Method: VGG-based Encoder-Decoder. Binary heatmap with Circle Hough transform.

TrackNetV2 (ICPAI 2020)

📄 Paper: ICPAI 2020

TracknetV2

Key improvements over V1:

  • U-Net Skip Connections: Replaced VGG encoder-decoder to reduce False Positives and trajectory jitter.
  • Multi-Frame Output: Output changed from W×H×1W \times H \times 1 to W×H×InputFramesW \times H \times \text{InputFrames} for smoother trajectory prediction.
  • Soft Gaussian Heatmap: Replaced hard binary labels with smoother Gaussian heatmaps (Soft labels) to handle motion blur.

TrackNetV3

📄 Paper: TrackNetV3

TrackNetV3

  • Background Integration: Added background image as input for better differentiation.
  • Mixup Training: Applied Mixup data augmentation.
  • Rectification Module: Introduced a module to rectify track misalignment during occlusions or overlapping.
Performance Assumption

TrackNetV3 significantly outperforms V2 in scenarios with heavy occlusions, but requires background frames for optimal initialization.