Multiclass anomaly detection in imbalanced structural health monitoring data using convolutional neural network

Structural health monitoring (SHM) system aims to monitor the in-service condition of civil infrastructures, incorporate proactive maintenance, and avoid potential safety risks. An SHM system involves the collection of large amounts of data and data transmission. However, due to the normal aging of sensors, exposure to outdoor weather conditions, accidental incidences, and various operational factors, sensors installed on civil infrastructures can get malfunctioned. A malfunctioned sensor induces significant multiclass anomalies in measured SHM data, requiring robust anomaly detection techniques as an essential data cleaning process. Moreover, civil infrastructure often has imbalanced anomaly data where most of the SHM data remain biased to a certain type of anomalies. This imbalanced time-series data causes significant challenges to the existing anomaly detection methods. Without proper data cleaning processes, the SHM technology does not provide useful insights even if advanced damage diagnostic techniques are applied. This paper proposes a hyperparameter-tuned convolutional neural network (CNN) for multiclass imbalanced anomaly detection (CNN-MIAD) modelling. The hyperparameters of the proposed model are tuned through a random search algorithm to optimize the performance. The effect of balancing the database is considered by augmenting the dataset. The proposed CNN-MIAD model is demonstrated with a multiclass time-series of anomaly data obtained from a real-life cable-stayed bridge under various cases of data imbalances. The study concludes that balancing the database with a time shift window to increase the database has generated the optimum results, with an overall accuracy of 97.74%.


Introduction
Civil infrastructure is rapidly aging worldwide due to increasing natural calamities, population, and operational loads. Structural health monitoring (SHM) system [7,13] is a valuable data-driven technology that involves the integration of sensors, data transmission, data analysis, and decision-making to protect these structures. With the advent of next-generation sensors, SHM has been an emergent and powerful diagnostic tool for damage detection and disaster mitigation of largescale structures [24,34,45]. In the past several decades, SHM systems have been widely applied to different civil infrastructures, including bridges, tunnels, railways, and buildings. Without an accurate and timely SHM, it may result in a substantial and costly repair of the infrastructure when there is significant damage identified in the structure [5,30,41]. Therefore, it is critical to detect any damages at the earliest possible stage to avoid substantial service interruption or potential safety risks. The lifespan of a civil infrastructure could also be lengthened with proactive detection and proper maintenance [42].
However, being exposed to outdoor environments, SHM systems of civil infrastructure face significant implementation challenges from environmental and measurement noises, sensor faults, data acquisition malfunctions, and operational conditions during the data collection stage [1]. Sensors are also affected by poor installation, inadequate maintenance, lack of protective coverings, and delayed replacement [23]. This often leads to the collection of anomalous data, requiring anomaly detection (AD) techniques such that only good quality data is used for structural diagnostics and decision-making.
AD is an essential step for data cleaning and preparation purposes of SHM before any diagnostic algorithms are implemented. The lifetime of SHM sensors is typically much shorter than that of the infrastructure [25], resulting in the normal aging and the need for the replacement of sensors. Installed sensors are often exposed to various weather conditions [10] and accidental incidences, which cause the sensors to lose connectivity, resulting in a hindrance to the measurement for key feature extraction. Sensor malfunctioning and faulty signals may potentially cause destabilization of the entire system and provide misleading information about the current condition of the structures [48,50]. Without proper data cleaning processes, the SHM technology is unable to provide valuable insights even if advanced damage diagnostic techniques have been applied. With the rapid growth of computational power, algorithm improvements, and data collection, various machine learning (ML) and deep learning (DL) methods [44] have been applied to the realm of SHM and AD.
AD techniques and algorithms are often proposed for binary classification and studied on in-lab or simulated data, aiming to separate anomalies from normal data. Steiner et al. [46] focused on detecting anomalous data due to sensor faults in wireless sensor networks. They proposed support vector regression for preserving the constrained resources of wireless sensor nodes. The proposed method was tested on a prototype wireless SHM system on a four-story shear frame structure for automated and decentralized sensor fault detection and isolation. Zhu et al. [56] explored a temperaturedriven moving principal component analysis method for AD. The proposed method calculated the covariance matrix within a pre-selected window size instead of the whole time series. Using the simulated case studies, it was concluded that the proposed method was more sensitive than the traditional moving principal component analysis, and it was able to detect anomalies during the expected period without any delay. Yang et al. [53] proposed a model to test a dataset obtained from a threestory laboratory structure. The proposed model was an end-to-end-trainable deep scenario based on a deep support vector data description. The proposed model was to map most of the data network representation into a hypersphere and define an anomaly score based on the distance of the point to the center of the hypersphere. Data that lie far away from the center or outside the hypersphere was identified as anomaly data.
Maes et al. [27] studied the effect of linear regression and linear principal component analysis models for AD in tunnel monitoring. The techniques were applied to a dataset obtained from an in-situ monitoring campaign in a tunnel. It was concluded that linear regression analysis was not suited as an AD tool for the tunnel-soil system due to a strong temporal dependence. The simulated numerical analysis showed the need for a sufficiently long training period, as well as the need for visual inspections and analysis coupled with linear principal component analysis. Wedel and Marx [52] studied the transient relationship between the air temperature and the bridge temperature with regression methods. The regression model was tested with long-term monitoring data of bridges in Germany. The research concluded that the ML methods could be used to detect sensor faults by comparing real measured values and predicted behavior. Sensor compensation via predicted sensor measurement was possible, assuming isolated and local sensor fault.
AD can be viewed as a data cleaning process for data compression and data reconstruction. Ni et al. [31] focused on finding an efficient unsupervised method for SHM data compression. They proposed a one-dimensional convolutional neural network (1D-CNN) with Adam Algorithm and mini-batch gradient descent for AD. Data were defined as abnormal when the measurements showed unacceptable deviations from the true values of the measured variables in either time or frequency domain. However, the proposed model was only useful for binary classification, namely the normal and anomalous categories. Jeong et al. [21] proposed a bidirectional recurrent neural network for sensor data reconstruction based on spatiotemporal correlation. For a system with N sensors, one was treated as the targeted output data, while sensors with high spatial correlations with the output sensor were treated as input sensors, resulting in an N combination. The reconstruction error was used as an indicator to detect potential anomalies of the output sensor. The faulty sensor was isolated based on the idea that the faulty sensor only causes local effects, and the proposed algorithm resulted in a higher computational cost for binary classification. Recently, Mao et al. [28] argued that supervised learning methods of AD had two unresolved challenges: imbalanced dataset and incompleteness of anomalous patterns for the training dataset. They proposed to use deep convolutional generative adversarial networks with autoencoders. The input for the network was transformed Gramian Angular Field images from time-series data, while the output was classified as either normal or anomalous data. Sarmadi and Karamodin [37] also explored unsupervised learning since it only required information on a single known structural state. They introduced an adaptive Mahalanobis-squared distance and one-class kNN rule to formulate a new multivariate distance. The proposed method improved the conventional Mahalanobis-squared distance technique for non-Gaussian or heavy-tailed distribution. The algorithm was tested in a lab truss structure. Similar to machine learning techniques, researchers also explored various deep learning techniques for AD. Bao et al. [4] explored data visualization by converting the time-series signals into images. This was achieved by splitting the data into sections and plotting it in grayscale images. They trained a deep neural network with greedy layer-wise pre-training and a fine-tuning stage. The proposed model was tested on acceleration data that were divided into six patterns: missing, minor, outliers, squares, trends, and drifts. It resulted in a total accuracy of 87% for one-year test data using a real-life cable-stayed bridge. Tang et al. [47] presented a two-dimensional CNN (2D CNN) with a combination of time-domain response and frequency-domain response as input images. The frequency-domain response was obtained through Fast Fourier Transform. Using a balanced dataset, the proposed model was verified on a long-span cable-stayed bridge, which achieved an overall accuracy of 93%. Arul and Kareem [2] presented to use a relatively new time-series representation named "Shapelet Transform" in combination with a Random Forest classifier to autonomously identify anomalies in SHM data. Jana et al. [20] proposed CNN for detecting the presence of a sensor fault and convolutional autoencoder to reconstruct sensor data based on its identified type. The proposed model was tested using simulated and experimental datasets to demonstrate its performance. However, the model was designed to train data with single fault types using a simulated balanced dataset. Liu et al. [26] proposed to use a generative adversarial network and CNN-based AD technique. The model contains a three-channel input by combining time-series data with its fast Fourier transform and Gramian angular field output. The proposed model used a generative adversarial network for addressing class-imbalance issues, followed by CNN for classification tasks.
Despite a large amount of work on AD using ML and DL, SHM data still has several implementation challenges associated with the availability of balanced training data. Success in implementing DL depends predominately on access to large amounts of data [12]. If the dataset is too small, the lack of sufficient image samples makes it difficult to converge in end-to-end learning [33]. Recently, AD techniques were also explored to imbalance datasets in different disciplines [8,19,51]. However, these studies were mostly focused on a single type of anomaly, which often results in poor accuracy for multiple anomalies [3]. To the authors' knowledge, there exist limited studies on multiclass AD of SHM data using limited imbalanced time-series datasets. Besides the size and imbalance issues of the sensor data, each ML algorithm has a hyperparameter setting. The optimal hyperparameter helps in building a better ML model [32], while the tuning process to find the optimal combination of hyperparameters in the DL models improves the overall accuracy and minimizes the loss function effectively. This paper proposes a hyperparameter-tuned CNN for multiclass imbalanced anomaly detection (CNN-MIAD) modelling of time-series SHM data.
The proposed CNN-MIAD model is a novel approach to solving multiclass AD with a limited imbalanced timeseries dataset. In the real world, it is impractical to label a large amount of SHM data. Thus, there may exist a limited input sample where the proposed CNN-MIAD model aims to include an overlapping sliding window to increase the size of the dataset. The performance of this model is compared and examined using the acceleration data from a real-life cable-stayed bridge in three cases with both balanced and imbalanced datasets of varying input sizes. The inclusion of the random search hyperparameter tuning strategy has further optimized the model performance.
The paper is structured as follows: Section 2 presents the proposed CNN-MIAD model with an overview of the CNN methodology and hyperparameter tuning processes; Section 3 illustrates the proposed CNN-MIAD model on imbalanced anomaly data of a real-life cablestayed bridge using data augmentation; Section 4 discusses the analysis and results of the proposed study. Section 5 presents the key conclusions and outcomes of this research.

Convolutional neural network
Deep learning (DL) can automatically learn abstract features of the original data and classify them effectively, avoiding the shortcoming of requiring handcrafted features designed by engineers [55]. CNN is a type of DL method that has seen rapid progress in recent decades due to the developments in computing power, the advent of large amounts of labelled data, and supplemented by improved algorithms in many disciplines [35,44]. It is suited for structured data such as images and mainly comprises the convolutional layers, pooling layers and fully connected layers. CNNs have been highlighted in computer vision, image recognition, and classification tasks, which are inspired by the visual cortex of animals [9]. During the training of a convolutional model, the model learns the spatial relationships between features within the target dataset, which may then be applied to test data for classification or recognition tasks. The 2D convolution function is used for the input image in each convolutional layer to produce a tensor of outputs. The convolution is the sum of the products between each image pixel and kernel pixel. Each convolutional layer is followed by a pooling layer to reduce the spatial size of an input array as a form of down-sampling. Adding a pooling layer to the model does not only save the spatial information of pixels but also reduces computational costs [36]. Scherer et al. [38] empirically proved that max-pooling operation was vastly superior for capturing invariance in the data and could lead to improved generalization and faster convergence when compared to a subsampling operation. The last few layers in a CNN model are the fully-connected layers, which require a flattened pooling layer or convolutional layer as the input. The neurons in the fully-connected layers provide a global connection to every neuron in the preceding layer, whereas the neurons in the convolutional layer are only connected to the neighbouring neurons based on the kernel size. The fully connected layers are used to get the probabilities of the input being in a particular class for classification tasks.

Hyperparameter tuning
Every machine learning (ML) system has hyperparameters, and it is critical to set these hyperparameters to optimize performance [17]. Examples of hyperparameters include the learning rate, batch size, and the number of neurons for the hidden layers. The learning rate of the model has a strong impact on the stability and efficiency of training. Choosing a learning rate that is too large results in the instability and divergence of the objective function, whereas choosing a learning rate that is too small results in slow learning and inefficiency [54]. Batch size defines the number of inputs that will be propagated through the network each time. Batch normalization, as one of the common regularization strategies, aims to deal with noise data, the limited size of the training data, and the complexity of classifiers to avoid overfitting [49]. Using a smaller batch size requires less memory and results in faster training; however, setting the batch size too small will result in less accuracy for the estimate of the gradient. Deciding the number of neurons in the hidden layers is important as too few neurons will result in the underfitting of the model, whereas too many neurons may result in overfitting and increase the time for training [16].
However, these hyperparameters cannot be directly estimated from data, and there exist no analytical formulas to calculate their appropriate values [22]. This leads to the need for hyperparameter optimization. To tune the hyperparameters, grid search is the most popular method, which allows the user to specify a finite set of hyperparameter combinations. As the number of tuning hyperparameters increases, the required number of function evaluations grows exponentially due to the full factorial design [29]. This results in wasted computational resources and inefficiencies as every combination of hyperparameter values has to be examined. An alternative to grid search is the random search algorithm, where a user only specifies the search space as a boundary of hyperparameter values. As the name suggests, the combinations of hyperparameters are randomly sampled within the domain. Bergstra and Bengio [6] proved that randomly chosen trials are more efficient for hyperparameter optimization than grid search. There also exists Bayesian optimization algorithm, which contains two key components: the probabilistic surrogate model consisting of a prior distribution that models the unknown objective function and an acquisition function that is optimized for deciding where to sample next [39]. This global optimization strategy requires a time-consuming procedure and results in a high computational cost to perform more computation on the next iterations. It is also usually unclear for a given practical problem what an appropriate choice is for the covariance function and its associated hyperparameters [43]. Therefore, due to the proven efficiency of the random search algorithm, the proposed method incorporates random search as the hyperparameter tuning method.

Data imbalance and augmentation
2D CNN models use images or image-like matrix input. To convert time-series data into image representation, the sensor readings are plotted against time. This conversion of time-series data into image representation serves the purpose of changing dimensions for data preprocessing. The time frame for each image should be kept constant to maintain the consistency of the input. Each image in the whole database is converted to grayscale with a matching dimension to the required CNN model input. The dataset is randomly split into a training set, validation set, and testing set based on a 70-20-10 ratio. Both the training set and the validation set are used to train the CNN model with hyperparameter tuning processes. The best combination of hyperparameters of a model is determined by the highest training accuracy and lowest validation loss. After generating the best-trained model with the appropriate hyperparameters, the testing set is used to evaluate the accuracy of the fully trained CNN model. However, a dataset is described as imbalanced when at least one category has relatively fewer samples than the other categories in classification problems. For classification tasks, the DL algorithm favours a balanced dataset as it provides equal information for each class, otherwise, the minority observations are likely treated as noise and ignored in the process. In some cases, most test data samples are classified into the majority group, resulting in the classification accuracy of the minority class tending to be much lower than that of the majority class [18]. This causes the imbalanced dataset to have higher false negatives on minor classes. The predictive output of classifiers trained with an imbalanced dataset is also biased because classifiers are less sensitive to the minority classes [14]. Furthermore, the class with the lowest number of samples is usually the class of interest from a learning task perspective [15]. Augmentation techniques can be used to address the imbalanced data issue, especially when the original dataset is too small [11,40]. In time-series data, one way to address the data imbalance issue is through data augmentation using a sliding window. As illustrated in Fig. 1, time-series data with a non-overlapping sliding window requires more raw data points to generate the same amount of input samples. The number of raw data points required would further decrease with a shortened interval between each time shift. Using this shortened time shift strategy, the number of input samples for the minor classes will increase with the same amount of data points in time series data. Figure 2 shows the architecture and parameters of the proposed CNN-MIAD model. The required input for the proposed model is grayscale images with an input size of 100 × 100 pixels obtained from the time-series of anomaly data. This is the first layer feeding into the CNN-MIAD model. The input then passes through three sets of convolution and max-pooling layers. The channel size increases with each convolution layer, whereas the dimension decreases due to the kernel size of 5 × 5. The kernel size for the max-pooling layers is 2 × 2, which further reduces the size of the convolved feature map. The default for the convolutional layer is a stride of one with no padding, whereas for the max-pooling layer is a stride of 2 with no padding. The tensor is then fed into a fully connected layer with a rectified linear unit (ReLU), where the number of neurons is one of the tuning hyperparameters. The output is then fed into the final layer, which uses a softmax classifier to generate the final output. The tuned hyperparameters in this model include the number of neurons in the second fully connected layer, the batch size, and the learning rate.

Performance metrics
Confusion matrices are a way of visualizing the performance of a classification model, where the predicted labels are plotted against the actual labels of a dataset. By doing this, it can be seen how accurately a model is able to classify a particular class, especially for a multiclass classification problem. For this study, the confusion matrix with detailed analysis on the accuracy, precision, recall and F1 score is used to analyze the performance of the model for various input datasets. The definitions of these metrics are defined in Eq. 1.
Where, true positive (TP) means a sample is correctly classified into its own category; true negative (TN) means a sample does not belong to a particular class and has been correctly identified in the other category; false positive (FP) means a sample has been misclassified from one of the other categories to the target category; false negative (FN) means a sample has been misclassified to one of the other categories from the target category. Accuracy is a basic metric representing the ratio of correct predictions and the total number of predictions. Precision returns the ratio between the TP and all the predicted positives, measuring a classifier's exactness. On the other hand, recall returns the ratio (1) between the TP and all the labelled positives, providing a measure of how accurately the model is able to identify the relevant data. F1 score is a harmonic mean of the precision and recall scores. These metrics are used to report the performance of the proposed CNN-MIAD model as the last step. Figure 3 provides a summary of the workflow of the proposed CNN-MIAD method to perform multiclass AD using imbalanced data.

Description of the dataset
The proposed CNN-MIAD model is analyzed with a dataset that consists of one month of acceleration data for a long-span cable-stayed bridge in China (IPC-SHM 2020). The dataset contains 38 acceleration sensors, and their locations are illustrated in Fig. 4. Each sensor collects data at a sampling frequency of 20 Hz. The data collected by the sensors are labelled into seven different classes: normal, missing, minor, outlier, square, trend, and drift. A detailed description of each of the seven classes is listed in Table 1. As part of the data preprocessing, the time series data is converted into images. Figure 5 shows an example of the input image for each class, the x-axis for each image represents the time, and the y-axis represents the acceleration. Each image has a one-hour time duration in the original database. The dimension for each of the original images is 875 × 656 pixels.   Table 2 Quantity and percentage for each data class of the selected benchmark data

Data preparation and augmentation
The original dataset is imbalanced, with nearly 50% of the data in the Normal class and only ~ 2% data belonging to the Outlier class. Although this is always the case in practice, data augmentation techniques can be implemented as a preprocessing step before feeding the input to the proposed CNN-MIAD model. As part of the data preparation process after the collection of raw data, three cases have been created, as shown in Table 2, to verify the effects of data imbalance. Case 1 (C1) uses the  Table 3 Ratio and quantity for the training, validation, and testing sets   Fig. 6, in which the accelerometer response for a single sensor is plotted against time. The hourly timeframe can only be joined together if it is consecutive, and the pattern label remains the same. The augmentation for C3 uses a sliding window of five minutes instead of one hour for the classes with less than 3000 images originally. Random selection is implemented to choose 3000 images from each class in the newly generated database, aiming to reduce the possibility of correlation and selection bias. Before the original images are processed into the CNN-MIAD model, each image is resized from 875 × 656 pixels to 100 × 100 pixels and converted from RGB to greyscale, as shown in Fig. 7. This resizing of the image maintains the consistency of the entire database. Each database is split into training, validation, and testing sets based on a 70-20-10 split ratio, respectively. The ratio and quantity of images for each of the subsets in the database have been included in Table 3. In summary, the data preparation and augmentation steps include the conversion of time-series data into image representations, generating three cases using overlapping sliding windows and a random selection, changing dimensions, and splitting into training, validation, and testing sets.

Hyperparameter tuning
The proposed CNN-MIAD model is trained with the aforementioned dataset in three cases, respectively. The random search algorithm is applied for the hyperparameter tuning process. The tuning parameters include the number of neurons in the second fully connected layer, the batch size, and the learning rate. The domain for each of the hyperparameters is included in Table 4. For each of the three cases, ten different hyperparameters are randomly generated. The trial with the highest training accuracy and lowest validation loss is defined as the best hyperparameter combination. The training process of the proposed CNN-MIAD model with the best hyperparameter combination for each of the three cases is shown in Fig. 8. The detailed results of the tuning process for C1, C2, and C3 are included in Tables 5, 6, and 7, respectively.

Performance evaluation
To evaluate the performance of the proposed CNN-MIAD model, five repetitions are generated using the best-tuned parameters for each of the three cases. As shown in Table 8, the overall accuracies and losses are calculated using the mean value for all five trials in each case. The training and testing accuracies are greater than 90% in all fifteen trials, meaning that the proposed model is free of underfitting. On the other hand, the testing accuracies and losses are close to the training results for each of the respective trials, meaning that the proposed model is not overfitting the training set.
To further understand the effects of balancing the database and the number of input image samples on the model output, the confusion matrix against the fifth trial of the testing set of each case has been included in Fig. 9. Using the confusion matrix for each case, the precision, recall and F1 score for C1, C2, and C3 are included in Table 9.
All three cases are trained on a Linux server with 24 Intel(R) Xeon(R) E5-2630 v2 processor. The training time for each random search model is around 130 minutes, 65 minutes, and 110 minutes for C1, C2, and C3, respectively. As observed by comparing C1 and C2 results, a database with more data points will generate a higher accuracy. The number of data samples in C1 and C2 are 28,234 and 3682, respectively. The accuracy of the model decreased from 97.70% to 93.59% as the number of image samples decreased. The F1 score for classes Normal and Minor has decreased drastically due to the smaller sample size for the training set. Although both C2 and C3 are balanced databases, the overall accuracy for C3 has increased from 93.59% in C2 to 97.74% due to the larger sample size. As seen through C1 and C3 results, although the overall accuracy is similar, the F1 scores for the Outlier and Drift classes have been remarkably improved Table 7 Hyperparameter tuning for C3 Table 8 Overall accuracy and loss of three cases with the optimally-tuned hyperparameter

Conclusions
This paper proposes a deep learning algorithm for the AD tasks within the realm of SHM. The proposed CNN-MIAD model requires the conversion of time series data into greyscale images, which then pass through several layers with hyperparameter tuning techniques to generate the best-performing model. The proposed CNN-MIAD model is tested with three cases using a database generated by a long-span cable-stayed bridge in China. The three cases are the original unbalanced database, balancing the database using the smallest number of images in a class, and balancing the database with augmented samples. The results reveal that balancing the database allows to improve the F1 score of the classes with fewer image samples originally, and a database with more image samples will increase the overall accuracy of the model performance. It concludes that augmenting the balanced database leads to the best results, with a  mean overall accuracy of 97.74% across five trials on the testing set. The application for the proposed CNN-MIAD model is not restricted to the realm of SHM but could be applied to classification problems containing timeseries data as raw data points. The complete flow of the model requires standardizing the input into a grayscale image database before passing it into the CNN-MIAD model. With sufficient raw data preprocessing, the CNN-MIAD model does not need modifications as the images are passing through. Incorporating a hyperparameter tuning process to categorize different classes satisfies the needs of a deep learning algorithm to achieve optimal results. Using the random search algorithm to tune the hyperparameters is more efficient by setting up the domain for each tuning parameter. The proven success of the CNN-MIAD model in SHM demonstrates the potential to expand its use in other imbalance dataset classification applications.