Two water supply networks (WSN) were analyzed to illustrate the proposed data generation framework and develop new leaking detection methods. The first network is a relatively small WSN that has been widely used as a standard testbed, which was chosen to illustrate the data generation process, development and validation of proposed leak detection methods. The second network is a large size WSN containing 5 district meter areas (DMA), multiple water sources (7 tanks and 1 reservoir) and 6 different control rules (such as valve controls and pump controls). The second WSN network was used to demonstrate the performance of the developed leak detection method under more complex conditions.
Case study – I: rancho Solano zone III water distribution system
Rancho Solano Water Network located in the city of Fairfield, California. This network is published by ASCE task committee on a research database for water distribution systems [17]. The graph of this water supply network is shown in Fig. 3. There are 112 nodes in total, including one reservoir as the source of water, and 126 pipes. The elevations of the nodes in this pipe network range from 90 m to 120 m and the length of the pipes range from 90 m to 130 m.
The water distribution network and basic water demand at each service node are shown in Fig. 3. The basic demand of each node is chosen from a uniform distribution of 0.008 to 0.012 L/s. The real demand at each node is generated by adding a random Gaussian distribution with variance σ = 0.01L/s. Eleven demand ratios from 0.3 to 1.3 are considered during the data generation with the hydraulic model for the WSN. The monitoring sensors are assumed to be deployed in the area shown in the red circle area, i.e., the water pressure data of the nodes which are located in the red circle in Fig. 3 is used for leak detection. Figure 3 also shows some key nodes and pipes that are analyzed in the study.
The pressure-driven demand model, which relates the water discharge to the water pressure head at the node (i.e., Eq. 4), is used in the hydraulic analysis of the WSN. The lower bound of the pressure head at the node is set as 5 m and the upper bound as 30 m. An example of the relationship between demand/discharge and pressures head at a node with base demand of 0.02 m3/s is shown in Fig. 4.
The overall data generation procedures of balanced dataset with the hydraulic model for the WSN are briefly summarized below. Similar amount of data is produced for the monitoring area in the WSN under both leaking and non-leaking conditions.
Dataset for non-leaking conditions
-
1.
Define Water Pipe Network: Construct the pipe network according to Rancho Solano Water Network (Fig. 3) using EPANET data input format.
-
2.
Assign Water Pipe Conditions: Assign the pipe roughness of pipe with a random number from uniform distribution U(100, 300). The length of each pipe is already defined in the original water pipe network
-
3.
Set the Baseline Demand and Actual Demand at User Nodes: the baseline demand of each node is randomly selected from a uniform distribution U(0.008, 0.012)L/s, following Funk et al. [15]. The actual demand at each node is set by considering both the base demand and the demand uncertainty, i.e.,
$$ Demand= deman\ ratio\ast {D}_{base}+\left|N\left(0,\kern0.5em \sigma \right)\right|. $$
(11)
where Dbase is the predefined base demand. The demand ratio is set from 0.3 to 1.3 to account for the fluctuation in water usage demand during a day or between different days. Gaussian noise N(0, σ) considers the uncertainty due to the water usage fluctuation.
-
4.
Data Generation: solve the hydraulic model of the WSN with WNRT using the EPANET built-in module and record the water pressure at selected monitoring nodes (i.e, the nodes inside the red circle in Fig. 3).
-
5.
Data generation for different water demand situations: step 2 to step 4 are repeated for each water demand scenery. 200 rounds of simulations were conducted for each scenery to generate sufficient amount of data under different water demand situations,
Dataset for leaking conditions
-
1.
Similar procedures as for non-leaking conditions are followed to build the water network, assign pipe roughness and water use demands (Step 1–3 for non-leaking conditions).
-
2.
Leak Scenario: Set pipe i as the leaking pipe. By default, the leaking position is located at the middle of the pipe, which, however, can be easily changed for more complex scenario.
-
3.
Data generation: solve the hydraulic simulator with WNRT using the EPANET built-in module and record the water pressure at selected monitoring nodes (inside the red circle in Fig. 3).
-
4.
Data generation for different water demand situations: Repeat steps 2–4200 times for each pipe at demanding level similar as what is done for non-leak conditions.
-
5.
Repeat the above step for each pipe leaking scenario.
The water pressure data under different scenarios were generated via the processes described. The non-leaking situation and leaking situation at each pipe contain 2200 cases respectively (11 different water demand levels with 200 rounds of simulations). It is noted that the model-generated data can be easily replaced with real-world data when measurement data is available.
The code for data generation is published in this link for the sake of open source.Footnote 1 Overall, the water pressure is affected by the average demand at the node, fluctuations in water demand, and if leak occurs. To illustrate the characteristics of water pressure data, the water pressure of node ‘168’ under a few demand ratios are shown in Fig. 5. As can be seen, the water pressure can be highly influenced by the demand levels. There are significant differences in the water pressure between demand ratios of 0.3 versus 1.2. A higher water pressure corresponds to a lower water demand. It should be noted that water pressure is also affected by leak. For example, Fig. 6 shows the water pressure at node ‘168’ is similar for intact water pipe at high average water demand ratio of 1.0 versus leaking pipe (pipe 198) with low average water demand ratio of 0.3. Such overlap in the influence on water pressure by water demand and leaking makes it difficult to detect leak from data from a single node. Machine Learning model, however, allows to extract features from the spatial pattern in the pressure data at multiple nodes and therefore allows to differentiate leaking versus non-leaking conditions.
Artificial neural network (ANN) model for leak detection
An ANN model is developed to detect leaks. The water pressure data at a group of monitoring nodes is used for this purpose. Unlike the existing approach of using time series analyses, the water pressure data are used by ANN model to find the spatial relationship among data at monitoring nodes at a given time. The ANN model was built and trained with TensorFlow in python environment. An optimal ANN architecture for this study is determined by an optimization process, which leads to an ANN model with one input layer, three hidden layers, and one output layer. The input layer contains 11 neurons corresponding to the 11 monitored nodes. The first hidden layer contains 128 neurons and then the remaining two hidden layers contain 258 neurons respectively. The output layer contains 1 neuron, which is a categorical data indicating leaking versus non-leaking condition.
Overall, the ANN model is used to classify leaking and non-leaking conditions of the monitoring area as a binary classification problem. As a supervised learning model, ANN requires the dataset to be labeled prior to training. A data set contains 2400 samples is prepared for ANN training process, which includes 1200 non-leaking samples and 1200 leaking samples. The non-leaking samples are randomly selected from the simulated non-leaking dataset and labelled as 0. The leaking samples are randomly selected from the leaking dataset that contains simulated data when leak occurred inside the monitoring area and are labelled as 1. The leaking and non-leaking dataset are generated follow the procedures that described previously. Standardization of the dataset is conducted to reduce the computing time and avoid potential overfitting. Each row of the dataset is transformed to a normal distribution with zero mean and unit variance. The benefits of data standardization is described by [40]. The 2400 labelled dataset is then randomly split into independent training data and testing data with a ratio of 7:3 (i.e., 1680 set of training data and 720 set of test data). The training dataset is used to train the ANN model. The independent testing data is used to validate the model results. The loss value of training and validation processes are shown in Fig. 7. The loss value is the mean square error of predicted result versus actual result. As can be seen, both loss values of training dataset and validation dataset decrease to small values during the learning process, which means the ANN model is able to uncover the relationship among data for classification of leak versus non-leak conditions.
The final classification result by the trained ANN model using the testing data is shown in Fig. 8 as described with the confusion matrix. There were 370 non-leaking cases and 350 leaking cases in the testing dataset, both are classified with 100% accuracy and with no misclassification. The results imply that a) the relationship of water pressure among a group of nodes is different under leaking and non-leaking scenarios; and b) the ANN model is trained to extract this relationship and to accurately classify leaking versus non-leaking conditions.
Autoencoder neural (AE) network model for leak detection
The ANN model achieved excellent performance by utilizing the water pressure data at multiple nodes. However, as a supervised ML model, ANN model requires balanced data, i.e., similar amount of data under both normal and leaking conditions. However, in the reality, the available data is typically unbalanced. i.e., there might be only limited amount of data under leaking conditions compared with data under non-leaking conditions. Besides, labeling the dataset to leak or non-leak conditions, i.e., such as the method used by [50] may be extremely difficult under real situation since the leaks might not be detected until their effects surface.
A variation of ANN model, the autoencoder neural (AE) network, is developed for leak detection to resolve the challenge of unbalanced data. As an unsupervised ML model, the AE model features unique advantages to work with unbalanced data. In this study, a AE model with 5 layers was built. As shown in Fig. 2, the first and last layer contains 11 neurons which are corresponding to the 11 monitored nodes. The second and third layer encodes the input data from 11 nodes to a lower-dimensional space, while the fourth and fifth layer decodes the data from this lower-dimensional space back to 11 nodes. The hidden layer of the AE model contains 3 neurons. Since the AE model features the ability to detect abnormal samples from dataset of normal samples, only the non-leaking samples are used for the training purpose. One thousand two hundred samples are randomly selected from the non-leaking dataset to train the AE model. Each sample in the dataset includes the water pressure information at the 11 selected monitoring nodes. 70% of the 1200 normal non-leaking dataset is standardized and used for training. The rest 30% of the normal non-leaking dataset is used for validation.
Figure 9 shows the loss values, defined as the reconstruction error for AE model, during the training process. Small loss values of close to zero are achieved for both the training and testing data, meaning the reconstructed output from the AE model is close to its input of dataset under normal non-leaking conditions. This also implies that the AE model is well trained with the normal non-leaking dataset. When leaking dataset is input to the trained AE model, the model will generate a large reconstruction error, which can be used to detect leaking conditions.
Independent datasets are used to evaluate the performance of the trained AE model to detect leaking conditions. The dataset includes three different scenarios, i.e. Five hundred fifty datasets from the non-leaking conditions, i.e. there is no leaks anywhere in the water distribution system; 550 cases of dataset where a leak happens at a random pipe inside the monitoring area (pipe ‘163’); 550 cases of datasets where the leak happens at a random pipe outside the monitoring area (‘pipe ‘198’). The location of example pipes can be found in Fig. 3. All of the data samples are normalized based on the mean and variance values of non-leaking dataset (by minimizing the mean value and divided by the variance value).
Figure 10 shows the statistics histogram of the reconstruction errors by the AE model for data under the three different scenarios. The rectangles with different color lines indicate the 97.5% range of reconstruction error for normal and two leaking situations. As can be seen in Fig. 10, the reconstruction error of data under normal non-leaking situation is small, with 97.5% of reconstruction error less than 0.00015, which is much smaller than those under the other two situations. The reconstruction error of data when leak occurs inside the monitoring area features largest reconstruction errors. While the reconstruction errors of data for leaking outside the monitoring area lies in between. Overall, dataset corresponding to pipe leaking within the monitoring area leads to large reconstruction error by the trained AE neural network model. The differences in the reconstruction error are clearly differentiable from those by the normal non-leaking cases. This can be used to define the threshold for leak detection. Leaks occurring outside the monitoring areas, however, still has a low probability to be identified as leaking situation. It should be noted that since a certain area is monitored by these sensors, the ideal result is only the leaking inside or very closed to this area can be detected. Leaking faraway from this monitoring area should not be able to trigger the detection. Method to mitigate detection error due to the influence of leaking outside the monitoring areas will be discussed in next sections.
Based on the observation, a threshold of reconstruction error can be defined for the trained AE neural network can be used to differentiate the leak versus non-leak situations. This threshold can be set based on the training process and be further tuned when data under leaking conditions is available. A threshold of 0.000402 is set for this case based on the reconstruction error at the end of AE neural network training. With this trained AE model, the monitored water pressure data can be fed into the trained model to obtain the reconstruction error. If the reconstruction error is larger than the set threshold value, a leaking alert would be triggered to promote actions such as inspection and replacement.
The performance of AE is evaluated at the situations where leaking happens at each pipe. For each single pipe, independent data of 2200 non-leaking cases and 2200 leaking cases are generated. The water pressure data inside the monitoring area is then fed into the trained AE model. Fig. 11 summarizes the probability of leaking alert is triggered, i.e. percentage of cases with a reconstruction error larger than the threshold, under each pipe leaking conditions of the WSN. As can be seen from Fig. 11 a), the alert triggering probability (i.e., false alert) under non-leaking situation is very low, or only about 3% maximum. Fig. 11 b) shows the probability leaking alert is triggered when leak occurs at each pipe. For the leaking happens inside the monitoring area, the alert has 68% to 100% probability to be triggered. For leak happens outside the monitoring area, the chance of triggered the alert is compromised (less than 40% for most parts). These observations imply that AE model can detect leaks from the monitored water pressure data. For the globally monitoring purpose, the monitoring sensors need to be strategically deployed in the WSN to achieve high reliability in leak detection.
Sensitivity study on factors affecting the accuracy of the AE model in leak detection
Sensitivity study is conducted to evaluate the effects of contributing factors on the performance of AE model for leak detection. These include three independent factors, including the compression ratios of AE, the sizes of leak, and fluctuation/uncertainty of water demand.
The compression ratio is the number of uncompressed data divided by compressed data as calculated in Eq. 12. It is an important hyperparameter of the AE neural network. A large compression ratio can not only save the physical data storage space but also force the AE model to learn the internal pattern of input data. However, too much compression may lead to excess information loss and decrease the detection accuracy. The range of compression ratio is selected between 1 to 6 for the sensitivity study.
$$ \mathrm{Compression}\ \mathrm{Ratio}=\frac{\mathrm{Uncompressed}\ \mathrm{Size}}{\mathrm{Compressed}\ \mathrm{Size}} $$
(12)
The uncertainty of water usage is the description of water demand fluctuation during a day. A higher fluctuation of water demand increase the difficulty for leaking detection since water demand and leak both affects water pressure in the WSN. To describe its sensitivity, the uncertainties of water usage are assumed to follow a normal distribution and are described with different water usage uncertainty levels, i.e., N(0, 0.001) L/s, N(0.005) L/s, N(0, 0.01) L/s, N(0, 0.05) L/s.
The leaking size is another important factor that influences the detection system performance. Conceptually, detection of small leak is much difficult than large leak, since smaller leak has less influence on the status of WSN and can be inundated with water demand fluctuations. For the sensitivity study, the leaking size is varied from 0.01 m to 0.12 m.
For each combination of these three factors, the performance is evaluated by a dataset generated by assuming leak occurs in a pipe (pipe ‘163’) inside the monitoring area and a data set with non-leaking. The data is randomly split for independent validation. The final accuracy is calculated as the average accuracy from a 3 rounds of cross-validation processes. Fig. 12 shows the leak detect accuracy of the AE models affected by the compression ratios, water usage uncertainty, and leak sizes.
As shown in Fig. 12, the AE model achieved close to 100% accuracy when uncertainty with water usage is small. At a given leaking size, the accuracy of leak detection by the AE model decreases with the increasing water usage uncertainties. As the water usage uncertainty level increases from 0.001 L/s to 0.015 L/s, the accuracy of the model decreases from 100% to 89.93%. However, even with high variance in water use uncertainty (compared to the baseline water usage at 0.012 L/s), the AE model achieved decent accuracy in leak detection.
The performance of AE model is significantly influenced by the leak size. Small leaks tends not to be detected and WSN is classified under normal non-leaking situations (i.e., 0% correct detection). While normal non-leaking cases are all classified correctly (i.e., 100% correct detection). This gives an accuracy of around 50% for a balanced dataset with equal number of data under both leaking and non-leaking conditions. With increasing leaking sizes, the AE model achieved higher leak detection accuracy. This is reasonable since the larger the leak size, the more disturbance it will have on the pressure distribution in the WSN to allow its detection. A similar conclusion was shown in Zhou et.al [50].
The compression ratio has a negative influence on the overall detection accuracy. For example, as shown in Fig. 12 a), at the leaking size of 0.06 m, the accuracy decreased from 85.24% to 67.02% when compression ratio increases from 1 to 6. For leaking size of 0.11 m, the accuracy decreases from 100% to 80.75% when compression ratio increases from 1 to 6. This is reasonable since the higher compression ratio will loss more information of original dataset. However, it is also noticed that the influence of compression ratio is small for compression ratio less than 2. A compression ratio of around 1.5 appeared to achieve the best results. It also should be noted that compared to the other two factors (leaking size and water demand uncertainty), compression ratio has a relatively smaller impact on the detection accuracy. However, for a given leak size and water demand uncertainty, fine tuning the compression ratio of AE model helps to achieve a higher detection accuracy.
Improving classification accuracy by incorporating multiple independent detection
The previous results show that by setting the proper detection threshold, the AE model achieved good leak detection accuracy using unbalanced data. This is a major advantage to the conventional ANN model, which requires balanced data under both non-leaking and leaking conditions. An observation is that the AE model did not achieve as high accuracy as the ANN model. It is desirable to further improve the accuracy of AE model that will help to reduce the amount of false detection (i.e., false leaking alarm or missing detection of leak event) detection. A method is proposed to further increase the leak detection accuracy by utilizing the probability theory for multiple independent trials. Intuitively, since leaking in the physical world will last for a while before it is repaired, the chance to detect the leak is higher if the effort is attempted multiple times. The leak status is unveiled by a voting strategy. In other words, for n attempts in leak detection, the detection outcome is defined as the outcome by the majority (more than 50%) of these attempts. Since each detection attempt is via independent data set, each represents an independent trial. Mathematically, if the probability of correctly detecting a leak under a single attempt is p, then the probability of more than half attempts correctly detect the leak will be
$$ \mathrm{p}=\sum \limits_{i=\mathit{\operatorname{int}}\left(\frac{n}{2}\right)}^n{C}_n^i{p}^i{\left(1-p\right)}^{n-i} $$
(13)
where n is the number of the total attempt for identifying a leakage. \( {\mathrm{C}}_{\mathrm{n}}^i \) is the set of i combination of set n. n is the total number of monitoring cases. p is the correct detection probability of each case.
According to Eq. (13), the probability of correct detection approaches to 1 when n approach infinite n → ∞, under the condition p is larger than 0.5.
According to the principle described by Eq. 13, multiple attempts were made for leak detection, i.e., multiple datasets under a given leaking or non-leaking condition are fed into the AE leak detection model. The final designation of leaking or non-leaking condition is based on if more than half of the detection attempts give that result.
To evaluate the performance of multiple attempts, three scenarios are considered, i.e. non-leaking situation, leaking in pipe 163 located inside the monitoring area, and leaking in pipe 214 located outside the monitoring. Two thousand two hundred set of pressure data are generated under each scenario. For each scenario, the accuracy with n times of attempts is calculated by the following procedures (also illustrated in Fig. 13):
-
1)
n sets of data are randomly selected from the 2200 cases.
-
2)
Each of the dataset is fed into the AE model to generate an output of Leak or Non-leak condition based on the set threshold.
-
3)
The final designation of Leaking versus Non-Leaking condition based on more than half of the n attempts give that condition.
-
4)
Determine if the defection is correct or wrong by comparing the detected condition by 3) with the actual condition of the pipe.
The procedure from 1) to 4) are repeated 1000 times. From this, the overall accuracy in correctly detecting the pipe condition is calculated.
Effects of multiple attempts
The results of accuracy under multiple attempts of detections using the pre-trained AE model are shown in Fig. 14. The detection threshold is set as 0.000402. The vertical axis is the detection accuracy of the pipe condition. The horizontal axis indicate the number of attempts in detection. As seen from this figure, the detection accuracy improved with multiple attempts and achieved close to 100% detection accuracy, regardless where leak occurs. This is consistent with what is predicted by Eq. 13.
Effects of detection threshold by the AE model
Threshold is a critical parameter for the AE leak detection model. Sensitivity analyses are conducted on the influence of threshold on the model detection accuracy. The results for the three different scenarios defined as in the previous context are shown in Fig. 15. Two thousand two hundred cases of dataset were generated for each scenery and are fed into the AE model to determine the reconstruction errors. From this, the percentage reconstruction error by AE model larger than the reconstruction error is determined. The horizontal axis are the thresholds and vertical axis is the percentage of cases with reconstruction error larger than the threshold (i.e., the case is identified as leaking by the AE model). The 50% line is also indicated in the figure. As can be seen from the figure, for all the three pipe condition sceneries, a smaller threshold corresponds to larger chance for the condition to be identified as leaking condition. For no leaking condition, this presents as false alarm. A larger threshold reduces the false alarm but may miss leaking cases. According to the eq. 13, leaks can be properly identified with multiple attempts as long as a detection accuracy is larger than 0.5. Based on this criteria, leak within the monitoring area can be accurately detected with any threshold between 0.00029 to 0.00057, by use of the multiple attempts strategy.
By using the proposed post-processing method with 50 time steps data. The final detection result is shown in Fig. 16. The result clearly demonstrates that when leaking happens inside or nearby the monitoring area, the AE model is able to detect such leaking happens correctly. For leaking which is far away from the monitoring area, the model can differential it from inside leaking situation to mitigate the false alert. There are two pipe leaking situations inside the monitoring area not detected. The main reason is the unappropriated threshold selection since all the pipes are using the same threshold. However, when more and more data available during the operation stage, this threshold can be tuned for each pipe, which will increase the detection ability eventually.
Case study – II: C-town water distribution network
C-Town water distribution network is a virtual network that was used for calibration competition in Battle of the Water Calibration Networks (BWCN) [33]. The topology and mode of operations of the network are described in details and the true network data are made public after the competition. This well-characterized water distribution network allows to test the proposed leaking detection method under a more complex scenario. From this, the performance of the proposed Autoencoder model based leak detection model is evaluated.
The topology of C-Town water distribution network is shown in Fig. 17. There are 1 reservoir and 7 water supply tanks. This network including 388 user nodes, 432 pipes, 11pumps, and 4 valves, which are divided into 5 district meter areas (DMA). The water demand at each node is provided. In this study, 4 predefined monitoring areas are chosen as shown in Fig. 17.
Hydraulic model of this C-Town WSN is built with WNTR. Since the water demand at each node is already defined, the actual water demands are used for hydraulic model rather than using the water demand ratio and uncertainty. The WDN under non-leaking situation and pipe failure (leaking size of 0.05 m) are simulated. The leaking data set and non-leaking data set for the C-Town WSN are generated following the same designed data generation procedures as described in Case study I Rancho Solano Zone III Water Distribution System.
Figure 18 shows the water pressure at node ‘J9’ with and without leaking. The leaking situation corresponding to leaking at pipe ‘P251’. The exact position of node and pipe can also be found in Fig. 17.
Performance of AE model in leak detection
The AE water pipe leakage detectors for each area is trained by using the samples from non-leaking data set. The performance of the AE leaking detection model is evaluated by calculating the probability of AE model triggering an alarm when leaking happens at each pipe, by sensors installed at different DMAs as shown in Fig. 17. For each pipe in the network, 169 sets of leaking and non-leaking water pressure data at nodes within the monitored area are generated. By feeding the water pressure data into the AE model, the probability successfully detection of each pipe leaking is shown in Fig. 19.
A few observations can be made from Fig. 19: 1) The probability of a successfully leak detection of a pipe is affected by the location of the pipe and the distribution of monitoring sensors. 2) the AE leak detection model achieved a high detection accuracy for leaks of pipe inside the monitoring area. 3) The probability of detecting leak in pipes outside the monitoring area is typically smaller and is affected by the topological structure of the WSN and setting of the AE model.
The proposed multiple detection attempts strategy is utilized to further improve the leaking detection accuracy and mitigate the false alarm. One hundred independent attempts are used. Leak alert will be triggered if more than 50 attempts indicated leaking (or reconstruction error by the AE model larger than the threshold). Fig. 20 shows the updated result of probability of detecting leaks in pipes in the WSN using this strategy. The results indicated leaks in pipes located in the monitoring area are all detected with 100% accuracy. Compared with the results shown in Fig. 18, the false alarm is significantly mitigated. In the meanwhile, leaks outside the monitoring area is not detected, except for Fig. 19 b). This is due to a conditional valve nearby and therefore leaks in these pipes have a larger disturbance to the WSN. The implication is that sensors need to be deployed in a strategic way to ensure the full coverage of the complete WSN.