Paper - DeepSort 논문 리뷰

Paper Review를 작성해 보았다.

DeepSort Paper Review

  • 논문 리뷰를 작성해보았는데, 중점적인 내용을 요약 형식으로 작성을 하게 되었다.


  • Simple Online and Realtime Tracking (SORT) is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms.
  • In this paper, we integrate appearance information to improve the performance of SORT.
  • Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches
  • During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space.
  • Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates


  • While achieving overall good performance in terms of tracking precision and accuracy, SORT returns a relatively high number of identity switches. This is, because the employed association metric is only accurate when state estimation uncertainty is low.

  • Overcome this issue by replacing the association metric with a more informed metric that combines motion and appearance information.


  • Adopt a conventional single hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association.

2.1. Track Handling and State Estimation

  • The track handling and Kalman filtering framework is mostly identical to the original formulation in SORT.

2.2 Assignment Problem

  • Into this problem formulation we integrate motion and appearance information through combination of two appropriate metrics.

Mahalanobis distance

  • To incorporate motion information we use the (squared) Mahalanobis distance between predicted Kalman states and newly arrived measurements
\[d^{(1)}(i, j) = (d_j - y_i)^T S_i^{-1} (d_j - y_i)\]
  • where Denote the projection of the i-th track distribution into measurement space by $(y_i ,S_i)$ and the $j$-th bounding box detection by $d_j$ .

  • The Mahalanobis distance takes state estimation uncertainty into account by measuring how many standard deviations the detection is away from the mean track location.

  • Using this metric it is possible to exclude unlikely associations by thresholding the Mahalanobis distance at a 95% confidence interval computed from the inverse $χ^{2}$ distribution.

\[b^{(1)}_{i, j} = 1[d^{(1)}_{(i, j)} \le t^{(1)}]\]
  • Denote this decision with an indicator that evaluates to 1 if the association between the $i$-th track and $j$-th detection is admissible.

Smallest cosine distance

  • second metric measures the smallest cosine distance between the $i$-th track and $j$-th detection in appearance space.
\[d^{(2)}_{(i, j)} = {min\{1 - r_{j}^{T} r_{k}^{(i)} | r_{k}^{(i)} \in R_{i}\}}\] \[b_{i, j}^{(2)} = 1[d^{(2)}(i, j) \le t^{(2)}]\]
  • Apply a pre-trained CNN to compute bounding box appearance descriptors.
\[c_{i, j} = \lambda d^{(1)}(i, j) + (1 - \lambda ) d^{(2)} (i, j)\]
  • where we call an association admissible if it is within the gating region of both metrics.
\[b_{i, j} = \prod_{m = 1}^{2} b_{i, j}^{(m)}\]
  • To build the association problem we combine both metrics using a weighted sum.

  • During our experiments we found that setting $ \lambda $ = 0 is a reasonable choice when there is substantial camera motion.

2.3 Matching Cascade

  • When an object is occluded for a longer period of time, subsequent Kalman filter predictions increase the uncertainty associated with the object location.

  • Therefore, we introduce a matching cascade that gives priority to more frequently seen objects to encode our notion of probability spread in the association likelihood.


  • This is a decrease of approximately 45%.

  • Overall, due to integration of appearance information we successfully maintain identities through longer occlusions.