In this section, the three modules of the OABB based tracking method of construction equipment, including enclosing contour detection, enclosing contour update, and tracking ID managing, are described in detail. Firstly, the enclosing contour is parameterized using five variables of OABB, a CNN-based contour detection model with multi-level features is built and the loss function is defined; secondly, the video frames are input to the model to get detected contours, and the motion model of the construction equipment is built to get predicted contours, tracked contours are updated from predicted contours using the detected contours; finally, the intersection over union (IOU) of OABBs is used to add, keep or delete multiple construction equipment IDs to obtain the tracking status of each equipment.

### Enclosing contour detection

The CNN-based detection module describes the construction equipment in images by OABB enclosing contours. Figure 1 shows the difference between HBB and OABB. HBB is defined by four parameters: centre point coordinate (*x*, *y*), width (*w*), and height (*h*), while OABB is defined by five parameters: *x, y*, *w*, *h* and rotating angle (*r*). Figure 1(b) compares the effects of equipment representations with two kinds of bounding box. The enclosing contour detection model, which aims to generate and regress OABBs, is modified from the CenterNet [30]. The model consists of two parts: backbone and detection head, as shown in Fig. 2.

Backbone provides multi-level features of construction equipment. A modified ResNet-18 base network (mResNet-18) is employed with four residual blocks, each comprising four convolutional layers with two shortcut connections. The residual network has a better fitting ability for extracting more accurate features, and it can also solve the problem of optimisation training when the number of layers increases. Four deconvolution layers are added to recover the spatial information. To speed up the detection efficiency, the output size of the mResNet18 is M / 4 × N / 4 (the size of the input image is M × N).

There are four regression parts in the detection head based on the OABB: centre point regression (*x*, *y*), offset regression (*off*_{x}, *off*_{y}), width and height regression (*w*, *h*), and angle regression (*r*). The four regression parts aim to learn the integers of the centre point coordinates, decimals of the centre point coordinate, width and height, and rotating angle of the OABBs with feature maps processed by (3 × 3 × 64, 1 × 1 × 2), (3 × 3 × 64, 1 × 1 × 2), (3 × 3 × 64, 1 × 1 × 2), and (3 × 3 × 64, 1 × 1 × 1) convolutional kernels, respectively. In the network inference stage, the heat maps from the centre point regression are processed based on 3 × 3 max-pooling, which functions as non-maximum suppression.

To decrease the difficulty of training and increase the efficiency of inference, a Gaussian heat map generated from the ground truth centre point coordinates (*x*_{0}, *y*_{0}) is employed in this research. *w*_{xy} and \(\hat{w}_{xy}\) are the actual and predicted weights in the Gaussian heat map, respectively. The Gaussian heat map weight at coordinate (*x*, *y*) is calculated based on a Gaussian kernel with six parameters: the Gaussian mean (*μ*_{1}, *μ*_{2}), Gaussian variance (σ_{1}, σ_{1}), and window size (*r*_{1}, *r*_{2}), using Eq. (1), as follows:

$$\begin{gathered} w_{x,y} = \left\{ {\begin{array}{*{20}c} {\exp \left\{ { - \frac{1}{2}\left[ {\frac{{(x - \mu_{1} )^{2} }}{{\sigma_{{1}}^{{2}} }} + \frac{{(y - \mu_{2} )^{2} }}{{\sigma_{{2}}^{{2}} }}} \right]} \right\},x_{0} - \frac{{r_{1} }}{2} < x < x_{0} + \frac{{r_{1} }}{2},y_{0} - \frac{{r_{2} }}{2} < y < y_{0} + \frac{{r_{2} }}{2}} \\ {0,others} \\ \end{array} } \right. \\ \mu_{1} = x_{0} ,\mu_{2} = y_{0} ,\sigma_{1} = \lambda w,\sigma_{2} = \lambda h,r_{1} = 2\sigma_{1} + 1,r_{2} = 2\sigma_{2} + 1 \\ \end{gathered}$$

(1)

The final Gaussian heat map weights at the coordinates (*x*_{g}, *y*_{g}) are modified based on the rotating angle of the construction equipment as shown in Eq. (2).

$$w_{{x_{g} ,y_{g} }} = \left\{ {\begin{array}{*{20}c} {w_{x,y} ,if\left[ {\begin{array}{*{20}c} {x_{g} = (x - x_{0} )\cos ag - (y - y_{0} )\sin ag + x_{0} } \\ {y_{g} = (x - x_{0} )\sin ag + (y - y_{0} )\cos ag + y_{0} } \\ \end{array} } \right]} \\ {0,others} \\ \end{array} } \right.$$

(2)

The training loss of the enclosing contour detector (*L*_{det}, defined by Eq. (3)) is divided into four components, designed based on the detection head: the centre loss (*L*_{c}), offset loss (*L*_{o}), width and height loss (*L*_{wh}), and angle loss (*L*_{ag}). *λ*_{c}, *λ*_{o}, *λ*_{wh}, and *λ*_{ag} are the corresponding weights, respectively. The centre loss employs focal loss for better training convergence, as controlled by Eq. (4), where *α* and *β* are adjustment parameters, and *N* is the number of heat map points, and the other three employ the L1 loss to regress the corresponding parameters.

The enclosing contour detection model is pretrained by construction equipment in MOCS proposed by An et al. [31]. For better generalization, the trained network is then fine-tuned by the collected overhead-view construction equipment dataset. The images of this dataset are captured by drone-borne cameras at different heights and angle, containing 600 images and 1570 equipment.

$$L_{\det } = \lambda_{c} L_{c} + \lambda_{o} L_{o} + \lambda_{wh} L_{wh} + \lambda_{ag} L_{ag}$$

(3)

$$L_{c} = \frac{1}{N}\sum\limits_{xy} {\left\{ {\begin{array}{*{20}c} {(1 - \hat{w}_{xy} )^{\alpha } \log (\hat{w}_{xy} ){\text{, if }}w_{xy} = 1} \\ {(1 - w_{xy} )^{\beta } (\hat{w}_{xy} )^{\alpha } \log (1 - \hat{w}_{xy} ){\text{, otherwise}}} \\ \end{array} } \right.}$$

(4)

### Enclosing contour update

The detection module could generate high-confidence enclosing contour of construction equipment at each frame without considering the temporal context information, resulting in an inability to match construction equipment between different frames. Inspired by Bewley et al. [32], this module employs a Kalman filter to model the frame-by-frame enclosing contours from detection module in the time domain. The Kalman filter predicts the enclosing contours based on the previous contours, and weights the predicted contours with the detected contours for much more accuracy. The state variables of OABB-based construction equipment motion (translation, size change and rotation) can be described as shown in Eq. (5):

$${\mathbf{x}} = \left[ {c_{x} ,c_{y} ,w,h,r,c^{\prime}_{x} ,c^{\prime}_{y} ,w^{\prime},h^{\prime},r^{\prime}} \right]^{{\text{T}}}$$

(5)

where \(c^{\prime}_{x}\),\(c^{\prime}_{y}\),\(w^{\prime}\),\(h^{\prime}\) and \(r^{\prime}\) are the first derivatives of the corresponding OABB parameters. Assuming that the construction equipment is moving at a relatively low speed (reasonable for equipment at construction sites), the size and orientation of the equipment will change uniformly over a short time Δ*t*. The state function describing OABB-based construction equipment motion could be expressed as Eq. (6):

$${\hat{\mathbf{x}}}_{k|k - 1} = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 0 & {\Delta t} & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & {\Delta t} & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & {\Delta t} & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & {\Delta t} & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & {\Delta t} \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ \end{array} } \right]\left\{ \begin{gathered} c_{x,k - 1} \hfill \\ c_{y,k - 1} \hfill \\ w_{k - 1} \hfill \\ h_{k - 1} \hfill \\ r_{k - 1} \hfill \\ c^{\prime}_{x,k - 1} \hfill \\ c^{\prime}_{y,k - 1} \hfill \\ w^{\prime}_{k - 1} \hfill \\ h^{\prime}_{k - 1} \hfill \\ r^{\prime}_{k - 1} \hfill \\ \end{gathered} \right\} = {\mathbf{Fx}}_{k - 1} { + }{\mathbf{w}}_{k - 1}$$

(6)

where \({\mathbf{x}}_{k - 1}\) represents the construction equipment state at the (*k*-1)^{th} frame and \({\hat{\mathbf{x}}}_{k|k - 1}\) is calculated state estimation at the *k*^{th} frame using \({\mathbf{x}}_{k - 1}\) and state function; \(\Delta t\) is the time interval of per frame, and **F** is the state transition matrix; \({\mathbf{w}}_{k - 1}\) indicates process noise of the investigated equipment motion model, assumed to be white noise with 0 mean and \({\mathbf{Q}}_{k - 1} = E\left( {{\mathbf{w}}_{k - 1} {\mathbf{w}}_{k - 1}^{{\text{T}}} } \right)\) covariance. The covariance estimation of the state variables, described by the state covariance matrix **P**, can be obtained by linearization of the equipment motion model from Eq. (7):

$${\hat{\mathbf{P}}}_{k|k - 1} = {\mathbf{FP}}_{k - 1} {\mathbf{F}}^{{\text{T}}} + {\mathbf{Q}}_{k - 1}$$

(7)

where \({\hat{\mathbf{P}}}_{k|k - 1}\) illustrates the predicted state covariance matrix using optimal estimation \({\mathbf{P}}_{k - 1}\) and the investigated equipment motion model.

In Kalman prediction stage, the predicted contours have certain difference with actual situations. Therefore, at this stage, the contour information of the detected construction equipment would be used as the measured value(s) for the Kalman update. The state transition from the state vector to the measurements is shown in Eq. (8), where **z**_{k} is the measurement of the *k*^{th} frame, and **H** is the measurement matrix. Only the former five parameters can be acquired from the actual detected contours; thus, the size of **H** is 5 × 10.

$${\mathbf{z}}_{k} = \left[ {\begin{array}{*{20}c} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ \end{array} }\,\,\right]\left[ {\begin{array}{*{20}c} {c_{x,k} } \\ {c_{y,k} } \\ {w_{k} } \\ {h_{k} } \\ {r_{k} } \\ {c^{\prime}_{x,k} } \\ {c^{\prime}_{y,k} } \\ {w^{\prime}_{k} } \\ {h^{\prime}_{k} } \\ {r^{\prime}_{k} } \\ \end{array} } \right] = {\mathbf{Hx}}_{k} + {\mathbf{v}}_{k}$$

(8)

where \({\mathbf{v}}_{k}\) represents the measurement noise, assumed to be white noise with 0 mean and \({\mathbf{R}}_{k} = E\left( {{\mathbf{v}}_{k} {\mathbf{v}}_{k}^{{\text{T}}} } \right)\) covariance. The Kalman gain (\({{\varvec{\upkappa}}}\)), calculated using Eq. (9), is the core matrix in the Kalman filter, considering both the prediction and the measurements to update

$${{\varvec{\upkappa}}}_{k} {\mathbf{ = \hat{P}}}_{k|k - 1} {\mathbf{H}}^{{\text{T}}} \left( {{\mathbf{H\hat{P}}}_{k|k - 1} {\mathbf{H}}^{{\text{T}}} {\mathbf{ + R}}_{k} } \right)^{ - 1}$$

(9)

Using the Kalman gain, the state vectors and state variances of the construction equipment from the Kalman prediction can be updated using Eqs. (10) and (11). And the updated OABB information of the construction equipment considering temporal detection information can be set as the final tracked enclosing contour of the *k*^{th} frame.

$${\mathbf{x}}_{k} {\mathbf{ = \hat{x}}}_{{k{|}k - 1}} {\mathbf{ + \kappa }}_{k} {\mathbf{(z}}_{k} {\mathbf{ - H\hat{x}}}_{k} {\mathbf{)}}$$

(10)

$${\mathbf{P}}_{k} {\mathbf{ = }}\left( {{\mathbf{I - \kappa }}_{k} {\mathbf{H}}} \right){\hat{\mathbf{P}}}_{k|k - 1}$$

(11)

### Tracking ID managing

The allocation of construction equipment IDs is a core issue in multiple construction equipment tracking. Most HBB-based tracking methods lead to the overlapping of boxes for multiple objects, resulting in a high complexity in the data associations between frames. For the OABB represented construction equipment, there is hardly no overlap between the OABBs. Therefore, this research employs the IOU of the OABB as the indicator for the ID managing part (calculated by Eq. (12)).

$$I(U\left(a,b\right)=\frac{OABB_a\bigcap oABB_b}{OABB_a\bigcup AAB_b}$$

(12)

The ID allocation of construction equipment can be divided into three states: add, keep, and delete. The result of the detected contours and that of the predicted contours are used to calculate the IOU. When the ratio is greater than the pre-setting threshold (*IOU*_{t}), the situation is denoted as 'matched'; otherwise, it is denoted as 'unmatched'. When there is an unmatched detected contour and the situation lasts for three consecutive frames, a new equipment ID should be added. When there is an unmatched predicted contour and the situation lasts for three consecutive frames, the corresponding equipment ID should be deleted. The matched detected OABB is used as the measurement for participating in the Kalman update to generate the final tracked contour, and the corresponding equipment ID is maintained.