Abstract:Constructing a predictive model for urban water supply pipeline failure events is crucial for assessing the likelihood of pipeline failures and serves as an important basis for the renovation and upgrading of water supply networks. The modeling methods for water supply pipeline failure models include classification and regression. Current research on failure models often employs only one of these methods for case analysis, lacking a comparison of the applicability and accuracy of both modeling methods. To address this gap, based on data from a specific instance of a water supply network, this paper establishes water supply pipeline failure classification and regression models using three machine learning algorithms: Random Forest (RF), Backpropagation Neural Network (BPNN), and Support Vector Machine (SVM). The concordance index (C-index) is used to compare the accuracy of the classification and regression models. Additionally, classification and regression indicators are employed to analyze the impact of modeling dataset division, as well as composition ratios of the dataset on the water supply pipeline failure models. The results show that the failure models constructed by RF exhibit the best performance, with the C-index of the classification models being 5.4% to 32.8% higher than that of the corresponding regression models. Compared to dividing the modeling dataset by year, randomly dividing the modeling dataset can enhance the predictive accuracy of both types of models. Furthermore, the impact of the modeling dataset composition ratio on the predictive accuracy of both types of models varies; as the proportion of non-failure pipeline data increases, the accuracy of the classification model in predicting pipeline failure events decreases, while the regression model shows reduced error in predicting pipeline failure times. Therefore, when constructing water supply pipeline failure models in practice, it is necessary to choose the modeling method appropriately based on the characteristics of the target dataset and pay attention to the impact of dataset division methods and composition ratios on the model results.