Classification of pavement climatic regions through unsupervised and supervised machine learnings

This study extracted 16 climatic data variables including annual temperature, freeze thaw, precipitation, and snowfall conditions from the Long-term Pavement Performance (LTPP) program database to evaluate the climatic regionalization for pavement infrastructure. The effect and significance of climate change were firstly evaluated using time as the only predictor and t-test. It was found that both the temperature and humidity increased in most States. Around one third of the 800 weather stations record variation of freeze and precipitation classifications and a few of them show significant change of classifications over time based on the results of logistic regression analyses. Three unsupervised machine learning including Principle Component Analysis (PCA), factor analysis and cluster analysis were conducted to identify the main component and common factors for climatic variables, and then to classify datasets into different groups. Then, two supervised machine learning methods including Fisher’s discriminant analysis and Artificial Neural Networks (ANN) were adopted to predict the climatic regions based on climatic data. Results of PCA and factor analysis show that temperature and humidity are the first two principle components and common factors, accounting for 71.6% of the variance. The 4-means clusters include wet no freeze, dry no freeze, dry freeze and snow freeze. The best k-mean clustering suggested 9 clusters with more temperature clusters. Both the Fisher’s linear discriminant analysis and ANN can effectively predict climatic regions with multiple climatic variables. ANN performs better with higher R square and low misclassification rate, especially for those with more layers and nodes.


Background
Climatic factors such as temperature and moisture have significant influence on the deterioration of both pavement structural capacity and pavement materials [1], and are key factors for pavement preservation and resilience analysis.Many countries have developed their own climatic region classifications for determining asphalt binder grade, including the USA [2], China [3], Jordan [4], Italy [5], Thailand [6], Iran [7] and Yemen [8].Not only are asphalt binder grade selected based on climatic regions, climatic factors are also critical for both flexural and rigid pavement design [9].Yang et al. conducted a sensitivity analysis of the influence of climatic inputs on pavement distress development using the Mechanistic-Empirical Pavement Design Guide (MEPDG) software [10].The MEPDG includes the Enhanced Integrated Climate Model (EICM) with historical hourly data from around 800 weather stations to model future climatic conditions for pavement performance prediction [11].Basma et al. found that structural number of flexural pavement need to be adjusted to offset a reduction in the subgrade resilient modulus due to an increase in moisture content [12].For pavement maintenance and preservation, Wang et al. found the same pavement preservation treatments used in different climatic regions performed significantly different based on the LTPP SPS-3 data [13].Different pavement preservation strategies or techniques should be considered for different climatic regions.

Development of climatic regions
The climatic regions are usually determined based on the maximum and minimum temperature and rainfall.Geographic and environmental factors may also be considered.In the Long-term Pavement Performance (LTPP) program, the annual total amount of precipitation and the freezing index [14][15][16] are used to divide the United States into four climatic regions, including dry no Freeze, dry freeze, wet no Freeze and wet freeze.However, there are still several challenges for climatic regionalization for pavement infrastructure.Firstly, the number of climatic regions may need to be increased to better quantify the effects of climatic factors.For example, Wang et al. used 6 climatic regions by adding dry mild and wet mild to the original four climatic regions [13].Bandara et al. classified the I-94 corridor in Michigan into four climatic regions [11].Many States have microclimates by considering different geographical or environmental conditions and adding more weather stations or expanding the number of months of available data [11,17].
Secondly, climatic factors include temperature, humidity, rainfall, snowfall, etc. and how to balance and consider all of those different factors is of importance.Bandara et al. used the average low temperatures in January, average high temperatures in July, average precipitation in January and average precipitation in July to classify the I-94 corridor in Michigan into 4 climatic regions [11].Wang et al. used the number of days below 0 °C, the number of wet days, and the freeze-thaw cycles to classify climatic regions [13].In agriculture studies, the K ppen climate classification considers only rainfall and temperature were usually used to determine climatic regions including tropical, arid, temperate, continental, and polar [18].However, to further improve the accuracy and effects of climatic regionalization, more detailed information is required.It has been reported that the potential water balance of the soil over the growing cycle, heliothermal conditions over the growing cycle and night temperature during maturation have been used to build a multiple criteria climatic classification system for the grape-growing regions [19].
Moreover, the climatic changes poses threatens to transportation infrastructure, which may change the climatic regions, especially for those at the margins of regions.Mills et al. evaluated the impact of climate change on flexible pavement design and performance based on 17 weather sites and found that low temperature cracking will be less problematic while rutting may cause earlier rehabilitation and reconstruction in Southern Canada [20].Gudipudi et al. used 19 climate models to project future climatic and analyzed the impact of climatic change on pavement performance.It was found that projected climate changes are likely to cause greater distresses and/or earlier failure of the pavement including 2-9% more fatigue cracking and 9-40% more rutting at the end of 20 years [21,22].A rational procedure would be necessary to adjust current climatic regionalization based on observed or predicted climatic dataset.

Machine Learnings in climatic regionalization
The weather stations collect more detailed long-term climatic data that can be used to improve the climatic regionalization for pavement infrastructure.There have been several studies on the climatic regionalization of pavements based on the data collected from weather stations.Recently, the machine learning has been applied in many fields to identify relationships between variables and to provide highly accurate prediction or classification base on massive data samples and therefore has the potential for pavement climatic regionalization analysis.Yang et al. [23] adopted Principle Component Analysis (PCA) to identify three major factors including temperature, precipitation, and radiation for climatic regionalization of pavements and then the k-means cluster analysis to classify pavement climatic regions.The probabilistic neural network and Support Vector Machine (SVM) were also used to predict pavement climate regions and fairly high accuracy were obtained.There are mainly three types of machine learning algorithms, unsupervised learnings, supervised learnings and reinforced learnings.The unsupervised learnings identify key components or factors and classify unlabeled data based on their correlations while supervised learnings classify data through minimizing the misclassification or error of a model trained by labeled data.Therefore, unsupervised learnings can be used to find the optimal classification while supervised learnings can be used to predict the classification.
The reinforced learnings are to determine the actions in an environment to maximize the cumulative reward and are usually used for optimization in operation research.
In climatic regionalization studies, k-means and hierarchical clustering have been used to redefine the climate zones of Turkey based on temperature and precipitation data collected from 113 climate stations [24], to determine the climate regions in Argentine [25], to determine rainfall regions in India [26], to identify regional climate change patterns [27], and to divide the European domain into regions of similar projected climate changes using predicted total temperature and precipitations [28].PCA is usually used to identify key climatic components that can be used as criteria determine climatic regions [29,30].One study reported that the annual variation in mean and minimum temperature, annual maximum temperature, and spring, summer, and fall precipitation are the five principle components for the climate regionalization in Puerto Rico [31].Factor analysis can be used to identify the common factors for climatic data.The temperature, winter moisture and moisture factors, explaining 46%, 32% and 12% of the total variance, were identified based on factor analysis for the climatic regionalization of the Tibet, China [32].ANN has been adopted to predict climatic regions in South America [33] and Puerto Rico [31].A supervised classification with Mahalanobis distance was used to classify climate regions in China based on the data collected from 172 stations between 1984 and 2013 [34].One study reported delineation of high resolution climate regions in Korean peninsula using the ANN, random forest, k-nearest neighbor, logistic regression, and SVM supervised learnings [35].Both unsupervised and supervised learnings were adopted to delineate homogeneous climatic regions in Pakistan.

Objectives and scope
The LTPP recommends temperature and freeze data to classify climatic regions while neglect other information including raining days, sub-zero days, etc. Further, a long term climatic data are needed to determine the climatic regionalization due to the high variation of temperature and freeze condition in areas around the borders of the climatic regions.As the collection and accumulation of more detailed climatic data, it is interesting to include those detailed long-term climatic data to help determine climatic regionalization.The objectives of this study are to determine the main contribution factors of climatic data collected from the LTPP weather stations, to classify climatic regions through unsupervised machine learning methods, and to predict climatic regions through supervised machine learning methods.The general trend of climate change over time was evaluated and the significance was tested through linear regression and parameter t-tests.The unsupervised learning methods includes PCA, factor analysis and k-means cluster analysis.The supervised learning methods includes the Fisher's linear discriminant analysis and the Artificial Neural Network (ANN).

Data collection
Established in the 1980s, the LTPP has been collecting large quantities of pavement data from more than 2400 pavement sections in the USA and Canada.The LTPP collects climate data from the weather stations located near the test sections.The detailed hourly data are available and the daily, monthly and annual statistics such as maxim, minimum, average are calculated and stored in the LTPP.In this study, 21,666 annual climate data from 1948 to 2012 were collected from 800 weather stations in 62 States in the US, Canada etc. Table 1 summarizes the definitions and statistical descriptions of those data.Sixteen variables were collected, including temperature, humidity, precipitation, snowfall and freezing conditions.In the LTPP program, the wet/dry threshold is average annual precipitation of 508 mm and the freeze/no-freeze threshold is average annual freezing index 83.3degree-Celsius days [36].

Climate change
Figure 1 shows the effects and significance of time on those climatic variables using time as the only predictor for simple linear regression.Positive and negative parameter estimates indicate increasing and decreasing trends, respectively.P-values lower than 0.05 are regarded as significant [37].The horizontal axle is the proportion of States over the total 62 States.It can be seen that the mean, maximum and minimum of the average temperature in most States increase significantly, while the freeze index and freeze thaw cycles decrease significantly in the last 60 years with p-values less than 0.05.The minimum temperature in nearly 80% of the States increase significantly with time.Therefore, the global warming has been significant, as investigated in previous studies using historical data or climate models [21].It is also noted that around 50% States show significant increasing precipitation and humidity while only 10-20% States show significant decreasing trend.The rising temperatures intensify the Earth's water cycle and increase evaporation, causing increased precipitation and flooding in some area close to the storm tracks.Satellite observations have found increased precipitation and total atmospheric water due to the increase of surface warming [38][39][40].
Figure 2 (a) shows the map of the climatic regions defined in LTPP.A general classification of each State is recorded in the LTPP database.However, it should be noted that the weather stations in one State may have different freeze or precipitation classifications.Actually, even for the same weather station, the freeze and precipitation classifications may change at different years.The freeze and precipitation classifications of each weather station at each year was calculated and the variation of freeze and precipitation classifications can be determined based on the criteria.For the 800 weather stations, 523 (65%) of them weather stations maintain the same freeze index classification and 564 (70%) of them   To investigate if time is a significant factor for the probability of a weather station is freeze/no freeze or wet/dry, the logistic regression analysis using the freeze or precipitation classification as the target and time as the predictor was adopted to build the models for each of the weather station.Five hundred-thirteen logistic regression models were built and the parameter estimates of time as well as its P-value were obtained.Among the 277 (35%) weather stations recording variations of the freeze classifications, 196 weather stations shows increasing temperature and 16 of them are significant, while 81 shows decreasing temperature and none of them is significant.Among the 236 (30%) weather stations recording variations of the precipitation classification, 140 weather stations shows increasing precipitation and 7 of them are significant, while 96 shows decreasing precipitation and none of them is significant.

Methodology for classification
PCA PCA is to convert a set of potentially correlated variables into a set of linearly uncorrelated variables.Each of the new variable is called principal component and is a linear combination of the original variables.As shown in Eq. ( 1), the first principal component F 1 is the linear combination of x 1 , x 2 , …, x p that has maximum variance among all linear combinations and accounts for as much variation in the data as possible.The second principal component F 2 is the linear combination of x 1 , x 2 , …, x p that accounts for as much of the remaining variation as possible, with the constraint that the correlation between F 1 and F 2 is 0. The third principal component F 3 is the linear combination of x 1 , x 2 , …, x p that accounts for as much of the remaining variation as possible, with the constraint that the correlations between F 3 , F 1 and F 2 are 0, and so on.a ij is the loading coefficients of x i on F j , indicating the correlation of x j on F i .Either the covariance matrix or the correlation matrix of the variables can be used to calculate the components from their respective eigenvectors.
The first several principal components can explain the major variation of the original dataset, and therefore can be used instead of the original dataset to reduce the dimensionality of a data set.In pavement engineering, PCA has been used to reduce the dimensionality of dataset.Ghasemi used 5 principle components explaining 89.72% of the total variance of the original 17 asphalt mixture properties variables as the inputs for an ANN model predicting pavement permanent deformation [41].Yao et al. reduced 21 traffic variables into 3 principle components for the pavement performance prediction [42].In this study, PCA based on correlation matrix was firstly used to investigate the main components of the 16 climatic variables.

Factor analysis
Factor analysis has been widely used in psychology, sociology and economic studies to find the lower number of unobserved factors that can explain the variability among correlated variables.As shown in Eq. ( 2), each variable is a linear combination of common factors and an error term.μ is the average or intercept.a ij is the factor loadings, indicating the contribution of common factors on the variance of the variable.f 1 , f 2 , …, f m (m ≤ p) are uncorrelated common factors.Factor analysis can be performed based on the orthogonal rotation technique of PCA or maximum likelihood method.
Factor analysis can be used to identify the common factors and to quantify the relationship between observed variables and the unobserved indicators.In pavement engineering, factor analysis has been adopted to evaluate the key factors of mixture properties and pavement performance.Tian et al. analyzed the 27 properties of asphalt mixture and find three common factors, including the permanent deformation factor, the shear resistance factor, and the moisture susceptibility factor [43]. Chen et al. used both single factor and multiple factor analysis to analyze the contributions of pavement performance measurements on the latent pavement performance factors including the roughness factor, the early age cracking factor and the aged severe damage factor [44].In this study, factor analysis based on the principal method was used to identify the major common factors of the 16 climatic variables.

Cluster analysis
Cluster analysis is a widely used unsupervised machine learning method to classify data samples or variables into different groups based on their similarity.Distance metrics such as the Minkowski distance, Block distance and Euclidean distance are usually used to measure the similarity between samples.K-means clustering is the most frequent used cluster algorithm classifying n sample into k clusters based on the distance.As shown in Eq. ( 3), it uses selected k centroids as the beginning points, and then performs iterative calculations to optimize the positions of the centroids by minimizing the distances within each cluster.The Cubic Clustering Criterion (CCC) can be used to estimate the number of clusters using k -means based on minimizing the withincluster sum of squares through Monte Carlo methods.High CCC indicates good clustering.K-means cluster method has been used for pavement performance evaluation and pavement automatic evaluation data process.Wang et al. used a normalized cuts clustering to classify 35 pavement sections with 8 performance indicators into 5 clusters with different performance levels [45].Li et al. used k-means clustering to identify the potential dipping in the groove measurement with laser profiling data [46].In this study, k-means clustering was used to classify the 21,666 samples into different climatic regions.

Discriminant analysis
The discriminant analysis is to classify samples into different groups based on its multiple characteristics.Different with cluster analysis which is an unsupervised learning, the discriminant analysis is a supervised machine learning and needs labeled classification.Frequently adopted discriminant algorithms include Bayesian discriminant, linear discriminant, etc.

PCA
Figure 3 shows the PCA results for the 16 variables based on correlations.Figure 3 (a) is the scree plot showing the eigenvalue corresponding to each principal component in order from largest to smallest.The eigenvalues for the first two components are 7.7, and 3.8, respectively.Figure 3 (b) shows the portions of each component on the total variation and are scaled to sum to the number of variables.The first two components account for 47.5% and 24.1% of the total variance, respectively.It is rational to use the first two components to represent all of the 16 variables, accounting for 71.6% of the total variance.Table 2 shows the loading matrix for the first five components.The i column of loadings is the i th eigenvector multiplied by the square root of the i eigenvalue.Each component is the weight sum of the 16 variables with loadings as the weighting coefficients.High loading value indicates high correlation between the variable and the component.The loading values higher than 0.5 were bolded for better illustration.It can be seen that the first component is mainly related to the five temperature factors and four freeze condition factors.The second component is mainly related to the five precipitation and humidity factors.The third component is mainly related to two humidity factors, and the fourth is mainly related to snow covered days.The rest components have much less correlations with all the variables.Therefore, the temperature and humidity components can be used to represent the 16 climatic factors.

Factor analysis
Table 3 shows the rotated loading matrix of the first two factors based on the orthogonal rotation technique of PCA.The scree plot and the proportion of eigenvalues are the same as in Fig. 1 Each of the 16 variable can be expressed as the weight sum of the two common factors which explaining 71.6% of the total variance.It can be seen that the first common factor is the temperature factor and the 10 temperature related variables have large loading values.The second common factor is the humidity factor and the six humidity related variables have large loading values.It is noted that the snowfall and snow covered days are more related to the temperature factor.Further, the maximum annual temperature is related to both the temperature and the humidity factors and the loading value for humidity factor is negative (− 0.61), indicating that the higher maximum annual temperature is usually related to lower precipitation level.

Cluster analysis
Firstly, the k-means clustering with four clusters were performed.Figure 6 shows the average of each of the 16 variables for the four clusters.When we classify all the samples into four groups, those are the center points that could achieve the minimum within cluster sum of squares.Based on temperature, precipitation and snowfall, we can estimate from Fig. 6 that cluster 1 is wet no freeze, cluster 2 is dry no freeze, cluster 3 is dry freeze and cluster 4 is snow freeze.The major difference between cluster 3 and 4 is not precipitation but the snowfall.They are not exactly as the original four climatic regions defined by the LTPP.Different number of clusters were also performed and the CCC values were shown in Fig. 4. It can be seen that the highest CCC is achieved at 9-mean clusters.Figure 5 shows the distribution of all the 21,666 sample for both 4-mean and 9-mean clustering, and the original four climatic regions defined by the LTPP in the coordinates of the first two principle components.The horizontal and vertical axles are the first and second principle components, representing temperature and humidity, respectively.For the 4-means clustering, it can be clearly seen from Fig. 5 (a) that cluster 1 and 2 are in high temperature region with high and low humidity, respectively.Cluster 3 and 4 are in low temperature region and cluster 4 has even lower temperature and higher  humidity, causing the high snowfall as shown in Table 4. Compared with Fig. 5 (c), the cluster borders are more distinct.
For the optimal 9-means clustering with the lowest within cluster sum of squares, we can see from Fig. 5 (b) that the data points are more centralized, especially for the temperature principle component.In additional to the freeze and no freeze clustering, the model suggest four to five temperature clustering.We could use cold, cool, mild, warm, and hot instead of the original freeze and non freeze temperature classification and could obtain nine climatic regions including wet hot, wet warm, wet mild, wet cool, wet cold, dry hot, dry warm, dry mild and dry cool.This finding agrees with Wang's recommendation to add wet mild and dry mild regions to the original four climatic regions [13].In summary, PCA and factor analysis can identify the main component and common factors for the 16 climatic variables and cluster analysis can be used to classify data samples or weather stations to help determining climatic regions.

Discriminant analysis
With known climatic regions, the supervised machine learning algorithm can be used to determine the regionalization based on collected climatic data from a new weather station or from the climate change.There are 477 samples in the original sample with no climatic regions labeled and therefore the rest 21,189 samples were used for the following supervised learning analyses.66% of randomly selected samples were used as training set and the rest were used as the testing dataset.The model parameters are firstly trained with the training dataset and then the model is tested with testing dataset.The Fisher's linear discriminant analysis was conducted first.Table 5 shows classification matrix for training and validation datasets.The sum of each row of the two datasets is 100%.In the training dataset, for the wet no freeze climatic region, 85% of the classification are correct and the majority (12%) of the misclassifications are classified as wet freeze.In the testing dataset, for the wet no freeze climatic region, 81% of the classification are correct and the majority (15%) misclassifications are classified as wet freeze.It can be seen that the classification matrix of the testing dataset are close to those of the training datasets, indicating there is no overfitting.Overfitting means the model is only valid for the training dataset but not work for the testing dataset and therefore the robustness the discriminant model is validated.
Figure 6 shows the distribution of classified samples in the coordinates of the first two principle components.It can be seen from Fig. 6 that results of the Fisher's discriminant analysis are very close as the original four climatic regions shown in Fig. 5 (c).The misclassification rate for the training and testing datasets are 13.6% and 14.4%, respectively, which is fairly good since it is classified based on the value of the linear combination of predictors.Therefore, the discriminant model can be used to classify climatic regions based on the 16 climatic variables while the classification accuracy could be further improved with proper supervised learning algorithms.

Ann
Due to the large volume of model parameters and nonlinear transformations capability, ANN has been proved to obtain higher prediction accuracy in machine learnings.As shown in Fig. 7, two ANN models were established.One has one hidden layer with five nodes, and the other one has two hidden layers with 10 nodes in

Conclusions and future research
In this study, the climatic data were used to investigate the climate regionalization for pavement infrastructure.Firstly, 16 historical climatic data variables including annual temperature, freeze thaw, precipitation, and snowfall conditions were extracted from the LTPP database and the effect and significance of climate change were evaluated.The unsupervised machine learning including PCA, factor analysis and cluster analysis were firstly conducted to identify the main component and common factors for climatic variables, and then to classify datasets into different groups.Then, the Fisher's discriminant analysis and ANN models were built to predict the climatic regions based on climatic data.The benefit of unsupervised machine learnings is to identify the key factors and find the optimal clustering of climatic    Both the Fisher's linear discriminant analysis and ANN can effectively predict climatic regions with multiple climatic variables.ANN performs better with higher R square and low misclassification rate, especially for those with more layers and nodes.This study focused on using multiple climatic data.In future study, the geological and solar radiation data could be included to potentially improve the clustering and prediction.

Fig. 1
Fig. 1 Results of the PCA

Fig. 2
Fig. 2 Climatic regions and its variations.a Original four climatic regions in the LTPP.b Variation of freeze classifications.c Variation of precipitation classifications

Fig. 3
Fig. 3 Results of the PCA. a Scree plot.b Proportion of eigenvalues

Fig. 4 5 Fig. 5
Fig. 4 CCC for different clusters conditions based on the similarities among data samples, while supervised machine learnings could provide a more accurate classification based on the data.Investigation on the LTPP annual climatic data shows that the mean, maximum and minimum of the average temperature in most States increase significantly while the freeze index and freeze thaw cycles decrease significantly.In addition, around 50% States show significant increasing precipitation and humidity while only 10-20% States show significant decreasing trend.The rising temperatures increase evaporation, causing increased precipitation.Around one third of the 800 weather stations record variation of freeze and precipitation classifications and a few of them show significant change of classifications over time based on the results of logistic regression analyses.Results of PCA show that the first two components, which are highly correlated with temperature and humidity respectively account for a total of 71.6% of the variance and can be used to reduce the dimensionality of the original climatic variables.Results of factor analysis show temperature and humidity are the two common factors, and the snowfall and snow covered days are more related to the temperature factor.The 4-means clusters include wet no freeze, dry no freeze, dry freeze and snow freeze.The 9-means cluster model with highest CCC suggest 4 and 5 temperature clusters for dry or wet conditions.

Table 1
Statistical description of the LTPP climatic data

Table 2
Loading matrix of the first five components

Table 3
Rotated loading matrix of the first two factors

Table 4
Average of climatic variables for each cluster

Table 5
Classification matrix for training and validation datasets of the discriminant analysis

Table 6
Classification matrix for training and validation datasets of the two ANNs Climatic region Training Validation 1 layer ANN Wet no freeze Dry no freeze Wet freeze Dry freeze Wet no freeze Dry no freeze Wet freeze Dry freeze