Comparative Analysis of Gradient-Boosting Ensembles for Estimation of Compressive Strength of Quaternary Blend Concrete

Concrete compressive strength is usually determined 28 days after casting via crushing of samples. However, the design strength may not be achieved after this time‑consuming and tedious process. While the use of machine learning (ML) and other computational intelligence methods have become increasingly common in recent years, findings from pertinent literatures show that the gradient‑boosting ensemble models mostly outperform compara‑ tive methods while also allowing interpretable model. Contrary to comparison with other model types that has domi‑ nated existing studies, this study centres on a comprehensive comparative analysis of the performance of four widely used gradient‑boosting ensemble implementations [namely, gradient‑boosting regressor, light gradient‑boosting model (LightGBM), extreme gradient boosting (XGBoost), and CatBoost] for estimation of the compressive strength of quaternary blend concrete. Given components of cement, Blast Furnace Slag (GGBS), Fly Ash, water, superplas‑ ticizer, coarse aggregate, and fine aggregate in addition to the age of each concrete mixture as input features, the performance of each model based on R 2 , RMSE, MAPE and MAE across varying training–test ratios generally show a decreasing trend in model performance as test partition increases. Overall, the test results showed that Cat‑ Boost outperformed the other models with R 2 , RMSE, MAE and MAPE values of 0.9838, 2.0709, 1.5966 and 0.0629, respectively, with further statistical analysis showing the significance of these results. Although the age of each concrete mixture was found to be the most important input feature for all four boosting models, sensitivity analysis of each model shows that the compressive strength of the mixtures does increase significantly after 100 days. Finally, a comparison of the performance with results from different ML‑based methods in pertinent literature further shows the superiority of CatBoost over reported the methods.


Introduction
Climate change and global warming have accelerated due to increasing emissions of greenhouse gases (GHG).This has led to serious environmental problems, such as drought, flood, heat waves etc. (Pandey & Kumar, 2022).The production of concrete used in the construction industry remains one of the largest sources of GHG, and accounts for about 50% of global emissions (Allujami et al., 2022a(Allujami et al., , 2022b;;Di Filippo et al., 2019).GHG from concrete production is expected to increase as demand for concrete keeps surging due to human development.The production of Portland cement (PC) produces vast amount of CO 2 through a process called calcination of Calcium oxide (CaO).This calcination accounts for around 7% of the global CO 2 emissions to the atmosphere (Benhelal et al., 2019).This emission is expected to increase as the annual consumption of cement would rise from its present 4000 million tonnes to about 6000 million tonnes by the year 2060 (Moreira & Arrieta, 2019).These figures show the need for sustainable and more environmental-friendly materials to replace cement partially or fully, not only to meet the growing demand, but to reduce emissions of CO 2 (Ebid et al., 2022;Mikulčić et al., 2016).
In view of the abovementioned problems, industrial wastes have been used in production of concrete.This approach results in a drastic decrease in PC used in construction as well as prevents environmental degradation caused by disposal of these hazardous industrial waste (Agrawal et al., 2021;Hashim & Tantray, 2021).The use of industrial wastes can reduce about 80% of GHG emissions of normal concrete.The commonly used industrial wastes that act as supplementary cementitious material in concrete include fly ash (FA), ground granulated blast furnace slag (GGBS) and silica fume (SF) (Hammad et al., 2021;Hashmi et al., 2021;Okashah et al., 2020).They have been used as partial replacements for cement when producing improved and more sustainable concrete.This practice is favoured by the availability of large quantity of these industrial wastes as about 300 million tonnes of FA is produced annually with only 25% of this production being used up for concrete production (Dan et al., 2021).Similarly, annual global production of GGBS is around 280 million tonnes with less than 10% of this production being utilised in concrete production (Kamath et al., 2021).
In the production of concrete for structural usage, an in depth and accurate knowledge of the properties are required (Ebid & Deifalla, 2022;Salem & Deifalla, 2022;Song et al., 2021).Compressive strength, being the most important property can be improved by partial replacement of cement with these cementitious industrial wastes in the accurate proportions.The compressive strength is generally ascertained by testing (crushing) concrete specimens (cubes or cylinders), usually after 28 days of casting (Allujami et al., 2022a(Allujami et al., , 2022b;;Ebid & Deifalla, 2021).However, this method of obtaining the compressive strength of concrete is time consuming, tedious and expensive (Badra et al., 2022;Silva et al., 2020).In addition, the desired strengths are often not attained, thus being less effective (Deifalla & Salem, 2022;Salami et al., 2022).This has led researchers to the use of machine learning (ML) and artificial intelligence (AI) algorithms to obtain the mechanical properties of concrete.The use of AI and ML techniques, such as decision tree (DT), artificial neural network (ANN), support vector machine (SVM), and extreme learning machine (ELM), in estimating (predicting) concrete properties takes into account certain parameters of the concrete (such as concrete mix proportions and concrete age) and its constituents to achieve reliable estimations (Gupta et al., 2006;Mustapha et al., 2022).
Several ML approaches have been proposed over the years for accurate estimation of compressive strength of concrete.For example, Cook et al. (2019) presented a hybrid ML model that combined firefly algorithm (FFA) with random forests (RF) to predict the compressive strength of concrete.A correlation between the input variables and output was developed by training the hybrid (RF-FFA) model with two different categories of data sets.They concluded that the hybrid RF-FFA model performed better than standalone ML models, such as SVM, RF, M5Prime model-tree algorithm and multilayer perceptron-ANN (MLP-ANN).Shariati et al. (2020) presented a novel hybrid ML approach using grey wolf optimizer to predict the compressive strength of concrete with partial replacement of cement.The results were compared to those obtained via an adaptive neuro-fuzzy inference system (ANFIS), extreme learning machine (ELM), ANN, support vector regression (SVR) with radial basis function (RBF) kernel (SVR-RBF), and another SVR with a polynomial function kernel (SVR-Poly).Dao et al., (2020aDao et al., ( , 2020b) applied an optimized conventional ANN to predict the compressive strength of foamed concrete.Dry density was included as an input parameter, while the volume of foam was ignored in their study.The results showed a high correlation R 2 of 0.97 for the models.The authors referred to ANN as a black-box model, since it provides no practical information about the predicted model, and citing the vast hidden neurons as major impediments to developing an empirical relation between input and output parameters.Abellán-García ( 2020) presented an ANN model with four layers to predict the compressive strength of ultra-high-performance concrete (UHPC).A total of 927 data samples and 18 mixture design variables were used as input.While impressive results were similarly reported, the proposed approach shares a common shortcoming with other aforementioned approaches in that the knowledge of the contribution of each input feature in the model predictions of the concrete mixtures is lacking.Besides, the results reported in most of these studies are still open to further improvement.
The quest for more accurate estimation of compressive strength of HPC has inspired the use of nature inspired classifiers, such as genetic expression programming (GEP).For instance, Ullah et al. (2022) applied a database of 191 data points to develop a relationship between the mix design parameters and compressive strength of foamed concrete using gene expression programming (GEP).The input variables were cement content, sand content, water to cement ratio, foam volume, while the output parameters were the dry density and compressive strength.The results showed that 95% of the predicted compressive strength had error values that were less than 2%.Recently, Shah et al. (2022) presented a comparative analysis using different ML techniques to predict the compressive strength of sugarcane bagasse ash (SCBA) concrete.The ML techniques included random forest regression (RFR), GEP and SVM.The results were compared to experimental testing.The input variables were water-cement ratio, cement content, SCBA dosage (SCBA%), the quantity of fine aggregate and coarse aggregate.The results showed that the R 2 of all the ML techniques were all above 0.85, and the RRMSE and performance index (PI) were less than 10% and 0.2%, respectively, with GEP producing the most accurate results across the compared methods.While GEP allow generation of simple mathematical equations for built models, it can be computationally expensive.Besides, its performance has long been shown to be similar or lower than other existing genetic programming methods (Oltean & Grosan, 2003).In fact, recent studies on compressive strength estimation such as (Fakharian et al., 2023;Salami et al., 2022;Song et al., 2021) have shown via empirical results that ML methods such as ANN and classifier ensembles outperform GEP across several evaluation metrics.
Boosting methods are a class of ensemble machine learning methods that have found wide application in many real-life domains with impressive results (Babajide Mustapha & Saeed, 2016).They generally enhance learning by merging the predictions of several simple base learners into a composite whole (Tanha et al., 2020).Different implementations of boosting ensemble have also been employed by several researchers for compressive strength estimation.For example, Kaloop et al. (2020) investigated the use of a multivariate adaptive regression splines (MARS) model to extract the optimum inputs to use for compressive strength design of HPC.The extracted features were fed to a gradient-tree-boosting machine (GBM).While improved results over comparative methods were reported, the authors also found concrete age to be the most influential input parameter.Feng et al. (2020) applied an adaptive boosting algorithm (Adaboost) to predict the compressive strength of concrete given curing time and mixture contents as input variables.Using tenfold cross validation method for model validation, the authors reported notable improvement in performance over classical methods, such as ANN and SVM.Nguyen-Sy et al. (2020) demonstrated an accurate prediction of the compressive strength of concrete using an extreme gradient-boosting (XGBoost) model.Sensitivity analysis was carried out to optimize the numbers of estimators by varying them from 100 to 1000 while keeping the default values of other hyperparameters constant.An increase in the number of estimators was found to generally lead to increased model accuracy.
In another related study, Cui et al. (2021) proposed a novel XGBoost prediction model based on grey relation analysis (GRA) for the estimation of compressive strength of concrete containing slag and metakaolin.Empirical findings showed that XGBoost outperformed ANN and its genetic algorithm hybridized variant (GA-ANN).Similar study by Nguyen et al. (2021) concluded that XGBoost and gradient-boosting regressor (GBR) models outperformed the likes of SVM and MLP for prediction of compressive strength and tensile strength of HPC.
Apart from XGBoost, there are other gradient-boosting implementations that have found application in concrete property estimation.For instance, Alabdullah et al. (2022) 2022) investigated the use of LightGBM in the estimation of the compressive strength of UHPC with similarly high prediction accuracy.In another pertinent study, de-Prado-Gil et al. (2022) applied a CatBoost (CBT) model to predict the compressive strength of a self-compacting concrete.The study was conducted using 381 data samples.Experimental findings show that the cement content had the highest influence on model output.
There has also been a notable growth in the application of deep learning methods for compressive strength estimation in recent years.Jang et al. (Jang et al., 2019) proposed image-based compressive strength estimation of concrete using three deep neural network (DNN) architectures, namely, ResNet, GoogLeNet, and AlexNet.Images of the surfaces of specially produced specimens were captured with a portable digital microscope and used to train each model for compressive strength estimation.Empirical results show that the DNN models outperformed the fully connected ANNs with ResNet showing the best performance.In addition, a deep learning-based estimation of compressive strength of fiber-reinforced concrete at elevated temperatures was proposed in (Chen et al., 2021).Using the concrete mix, heating profile, and fiber properties as model inputs, three variations of convolutional neural networks (CNN) models were shown to outperform several models that include SVR, ANN and Adaboost.In addition, deep learning models such as CNN have been hybridized with evolutionary algorithms, such as GA for improved performance (Ranjbar et al., 2022).More recently, Hoang (2023) proposed a deep learning-based estimation of the compressive strength of rice husk ash-blended concrete using an asymmetric loss function.Results from this study showed better performance than ANN and multivariate adaptive regression splines.
The pursuit of accurate estimation of compressive strength of concrete has inspired myriad of research studies over the years, each seeking to achieve this goal via some machine learning methods.However, findings as indicated from the foregoing show that the gradient-boosting ensembles and DNN-based approaches stand out, mostly performing better than popular methods, such as SVR, classical ANN, GEP, KNN and their hybrid variants amongst others.The gradient ensembles methods are particularly the focus of this study, given their high accuracy and interpretability.Besides, a comprehensive comparative study on gradient-boosting algorithms for prediction of compressive strength of quaternary blend concrete remains lacking.Such study has the potentials of guiding field engineers on the choice of computational tools for accurate and reliable estimation of properties when designing concrete.
Thus, this study aims to compare the performance of four gradient-boosting algorithms in estimating the compressive strength of quaternary blend concrete.The algorithms are gradient-boosting regressor (GBR), light gradient-boosting model (LGBM), eXtreme gradient boosting (XGB), and CatBoost (CBT).In the training phase, hyperparameter optimization of each algorithm is first carried out using fivefold cross validation to ensure optimal model performance.Twenty optimal models were built, five for each gradient-boosting algorithm, using different training-test splits to obtain best performing model in terms of mean squared error.The input variable are the proportions of cement, ground granulated blast furnace slag (GGBS), fly ash (FA), water, superplasticizer, coarse aggregate, fine aggregate, and concrete age.The performance of each of the final model is evaluated using four popularly used statistical measures, namely, root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and coefficient of determination (R 2 ).A sensitivity analysis is carried out to understand the importance and contribution of the input/predictor variables.Finally, a comparison of the obtained results with results in previous literatures (other methods).
The key contributions of this study are highlighted as follows: • Prediction of the compressive strength of quaternary blend concrete using CBT.• A comprehensive comparative analysis of gradientboosting algorithms (GBR, CBT, XGB and LGBM) for the estimation of quaternary blend concrete.• An intuitive insight into the importance and contribution of input features for the estimation of quaternary blend concrete.• Comparison of performance of gradient-boosting algorithms with results from previous studies.

Computational Methods
The gradient-boosting ensembles considered in this research are gradient-boosting regressor (GBR), light gradient-boosting model (LGBM), eXtreme gradient boosting (XGB), and CatBoost (CBT).These models have been selected based on their performance in pertinent studies relating to estimation of mechanical properties of concrete.The advantage model interpretability offers makes it especially useful for field engineers, allowing them to understand the impact of input parameters without undergoing tedious and time-consuming laboratory experiments.Each of the selected methods are detailed in what follows.

Gradient-Boosting Regressor
Gradient-boosted decision trees (GBDT) have been widely used in machine learning.However, gradientboosting regressor (GBR) (Friedman, 2002) is arguably the earliest well-known implementation of the idea of gradient descent boosting of decision trees that optimizes an arbitrary differentiable loss function via stagewise additive approach in model building.Every iteration of the model building process involves fitting a classification and regression tree (CART) on the negative gradient (i.e., the residual error between the estimated and the target output) of an arbitrary loss function (Friedman, 2002).Gradient boosting of decision trees has been shown to be robust to overfitting while producing highly competitive results especially while modelling noisy data.In addition, it is also interpretable as it offers relative importance of input features used in model building.The two main hyperparameters for optimal Gradient boosting are the number of boosting stages and the shrinkage parameter, also known as the learning rate (Friedman, 2001).
In general, in GBR, the model is initialized with a constant value γ (A tree with just one leaf node) that minimizes the loss over all the samples as in the following equation: This is followed by several iterations of negative gradient computation of the loss function L and its subsequent usage to fit a decision tree and addition of a new model to the ensemble as in the following equation: where v is the shrinkage parameter used to control over- fitting.Although, GBR is used for regression problem in the present study, it is also suitable for classification problems.Extensive details of the theoretical foundation of gradient-boosting regressor can be found in (Friedman, 2001(Friedman, , 2002)).

XGBoost
Another gradient-boosting implementation that is considered in this study is the extreme gradient-boosting (XGBoost) algorithm.XGBoost is an optimized variant of gradient boosting that combines the predictions of several "weak" classification and regression tree (CART) learners to develop a "strong" learner using additive training strategies (Chen et al., 2015).XGBoost is especially known for preventing overfitting efficiently through a simplified objective function that combines the loss and regularization terms.The regularized optimization objective is as in the following equation: where l is the loss function that measures the difference between the experimental, y m , and the estimated y m out- put; is the regularization term given as the following equation: where T and w are the number of leaves and the score on each leaf, respectively; γ and are constants for con- trolling the degree of regularization.Although used for (1) regression problem in this study, XGBoost is suitable for all types of supervised learning problems.See Chen et al. (2015) for detailed background on this algorithm.

LightGBM
Another novel implementation of gradient-boosted decision tree (GBDT) that has been proposed to address the scalability and efficiency problem of its traditional counterpart is LightGBM (LGBM) (Ke et al., 2017).Unlike the traditional GBDT which entails the time-consuming process of scanning all data samples to estimate the information gain of all possible split points for each tree node, LGBM proposes two novel techniques called gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB).In the GOSS, only samples with large gradients are considered important and used in the estimation of information gain for split point selection.Thus, a significant proportion of data samples are excluded when estimating the information with little or no impact on the accuracy of estimated gain.On the other hand, the EFB technique carries out the NP-hard problem of bundling mutually exclusive features (i.e., they rarely take nonzero values simultaneously) to reduce the number of features with negligible impact on the split point determination accuracy.Although used for regression problem in this study, LGBM is suitable for all supervised learning problems.Further details on LGBM can be found in (Ke et al., 2017).

CatBoost
Similar to the aforestated GBDT algorithms, CatBoost (CBT) is also a machine learning algorithm that leverages gradient boosting on decision trees.CBT is a unique GBDT implementation that is known for its categorical feature handling capability (Dorogush et al., 2018).The two main algorithmic advances introduced in CBT are the implementation of ordered boosting which is a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features.Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient-boosting algorithms.Likewise, CBT has the advantage of using a new schema for leaf values calculation when selecting tree structures, which greatly alleviates the problem of overfitting.Although used for regression problem in this study, CBT is suitable for all supervised learning problems.Extensive details on CBT can be found in (Dorogush et al., 2018).

Data Description
The quaternary concrete data applied in this study are experimental results obtained from (Lichman, 2013).The compressive strength which is the most important property of concrete should be accurately and reliably modelled for a quaternary concrete.Thus, the data has been carefully selected to cover the compressive strength for a wide range of days, ranging from 1 to 365 days.To the best of our knowledge, these data are the largest and most widely used data set for compressive strength estimation.Hence, its popularity makes the results of this experiment comparable to a wide range of previous studies.The variables used as input in the modelling are age (days), portions of cement (Kg/m 3 ), GGBS (Kg/m 3 ), FA (Kg/m 3 ), water (Kg/m 3 ), super plasticizer (Kg/m 3 ), fine aggregate (kg/m 3 ) and coarse aggregate (Kg/m 3 ).Fig. 1 presents a visual distribution of each feature.The numerical values of the basic statistics of the features of the 1030 data samples are also presented in Table 1.The statistics of the data set show the mean, standard deviation, minimum value, lower quartile, middle quartile (median), upper quartile, and maximum value to indicate consistency and suitability for use in this study.
In addition, a correlation analysis of all the input variables to the output, the compressive strength, is also presented to understand how changes in each input variable bring about corresponding changes in output.Correlation Coefficient (CC) was used to assess the sensitivity of each component (feature) of the concrete mixture to the compressive strength (MPa) (Mustapha et al., 2022;Salami et al., 2021).From Fig. 2, it can be observed that the input variables (cement, GGBS, water, superplasticizer, coarse aggregate, fine aggregate and age) have varying degrees of correlation with the output.Four of the input variables (cement, GGBS, superplasticizer and age) are positively correlated with the output, whereas the remaining four (fly ash, water, coarse aggregate and  fine aggregate) are inversely correlated.Positive correlation here implies that an increase or decrease in these input variables result in corresponding increment or decrement in the compressive strength, respectively.On the other hand, increase in the inversely correlated variables leads to decrease in the compressive strength of concrete and vice versa.

Experimental Setup
The steps involved in the experimental setup of this research is depicted in Fig. 3. Following the statistical description of each variable of the data set (see Sect.Cross validation is often used to assess the generalization capability of models in ML by splitting a given data set into two parts, where a portion is used for model training and the other is used to test how well the trained model is likely to generalise to an unseen data.However, due to varying ratios of training-test splits that have been reported in the literature, the performance of GBR, XGB, LGBM and CBT with optimized hyperparameters are initially examined across five training-test ratios that include 90:10, 85:15, 80:20, 75:25 and 70:30.The experimental results of this process are presented and discussed in Sect.5.2.The hyperparameter optimization for each model is carried out using only the training split of the data set to ensure that each model does not have access to the test partition prior to testing as in real-life application of machine learning.Each of the gradientboosting algorithms considered in this study has a wide range of tuneable hyperparameters for optimal model performance; however, only a few have been selected for optimization.An exhaustive search of every possible combination of values within a specified range for each selected hyperparameters is used to train each model using fivefold cross validation.In other words, the training data are further divided into 5 equal partitions, each of which is, respectively, used to test the performance of a model trained with the remaining four partitions using a combination of hyperparameters at a time.The combination of hyperparameters that produce the best (lowest) average mean squared error over this process is deemed the optimal model parameters that is used to train the model on the entire training set before testing with the test partition that was initially set aside.As reported by Nguyen-Sy et al. (2020), increasing the number of estimators is similarly found to generally result in improved model performance.Hence, a search space of 10 to 1000 estimators is considered in this study.
Table 2 shows the hyperparameter search space and the optimal combination of hyperparameters for the 90:10 training-test split for all models.The model trained with optimal parameters is then evaluated using the evaluation metrics described in Sect.3.3.All experiments were performed using python programming language.The Scikitlearn (Pedregosa et al., 2011) implementation of gradient-boosting regressor was used for GBR model, whereas the official python implementations of XGB, LGBM and CBT were similarly used for the respective model implementations.

Evaluation Metrics
To evaluate the performance of the developed machine learning models in this study, widely accepted statistical metrics such as the coefficient of determination (R 2 ), root where y m is the experimental output, y m is the model estimated output, y is the mean of the experimental output, y is the mean of the estimated output, n is the number of samples.MAPE also has ε which stands for an arbitrarily positive small constant to avoid division by zero when y m is zero.For each of MAPE, RMSE and MAE, the lower the value, the better the model.On the contrary, achieving an R 2 value close to 1 is the goal of the learning algorithm, i.e., the closer the R 2 value to 1, the better.A baseline model which always predict the mean of the experimental output y will have an R 2 of value 0, whereas a worse model than the baseline will produce a negative R 2 value.
In addition to the results based on these evaluation metrics, a ranking test using Friedman's test (Friedman, 1940) is also carried out to test the null hypothesis that the means of the results of the gradient-boosting ensemble (5) y m − y m max ε, y m methods are the same at significance level of 0.05.If this null hypothesis is rejected, Holms's test (Holm, 1979) is performed as a post-hoc analysis of the pairwise comparison of the performance these methods is carried to establish if one is significantly better.The null hypothesis of the Holms's test is that the mean of the results of a pair of groups is equal.All statistical analysis were carried out on the STAC web platform for statistical analysis (Rodríguez-Fdez et al., 2015).

Model Performance Across Varying Training-Test Splits
As hinted in Sect.3.2, the lack of a globally accepted training-test split ratio inspired a preliminary study on five popular training-test ratios that include 90:10, 85:15, 80:20, 75:25 and 70:30 (e.g., 75:25 implies that 75% of the data set is used for training, while the remaining 25% is used for testing).For each training-test ratio and learning algorithm, hyperparameter optimization is first carried out as described in Sect.4.2 before model training and testing for estimation of compressive strength.The training and test performance of GBR, XGB, LGBM and CBT for the different training-test splits in terms of RMSE, R 2 , MAPE and MAE is presented in Fig. 4. As expected, the training performance of the model for the different training-test splits is generally better than their respective test performance across the evaluation metrics.However, being the true measure of the performance of the models, the test performances are relatively impressive given the marginal difference between the training and test scores.The general trend from Fig. 4 shows that as the test fraction of the training-test ratio increases, the models' respective performance tends to decrease across the evaluation metrics.Moreover, unlike the remaining training-test ratios, 90:10 consistently produced the best performance across the evaluation metrics for each learning algorithm; corroborating what was reported in (Salami et al., 2021).Hence, the result of the 90:10 training-test ratio for each of GBR, XGB, LGBM and CBT is selected and discussed in detail in the next section.The mean of the training and test scores for each model (with standard deviation) over the different ratios are also presented for each evaluation metric in Table A of the supplementary material.

Performance Comparison of Best Performing Model for Each Algorithm
Table 3 presents the training and test scores based on the evaluation metrics for compressive strength estimation using the ML methods under study.The best result for each metric is highlighted in bold.In terms of R 2 which measures how well the models approximate the experimental compressive strengths of each concrete mixture, the training and test performances of GBR (0.9950 and 0.9731), XGB (0.9909 and 0.9764), LGBM (0.989 and 0.9745) and CBT (0.993 and 0.9838) are, respectively, very impressive given the small generalization gaps of 0.0219, 0.0145, 0.0145, and 0.0092 between the training and test performances of the respective models.This implies that despite fitting the training data to near perfection, the models are still able to generalize their training performance quite well.However, comparatively, the test R 2 score of 0.9838 achieved by CBT is better than 0.9731, 0.9764 and 0.9745 produced by GBR, XGB and LGBM, respectively.This indicates a performance improvement of 1.1%, 0.75% and 0.95% over the trio, respectively.A comparison of the experimental and estimated compressive strengths by the gradient-boosted ML models are presented in Fig. 5. Fig. 5 shows the scatter plots of the estimated compressive strengths plotted against experimental ones with the respective line of best fit for the training and test phases of each of GBR, CBT, XGB and LGBM models.The plots intuitively illustrate how correlated the model estimations are to the experimental values.The corresponding R 2 (i.e., coefficient of determination) value on each plot summarises its performance with a single score.In general, the plots show that despite producing a more correlated training estimations of the compressive strength, GBR produced the least correlated estimates in the test phase.The test compressive strength estimations of CBT are most correlated with the experimental values, followed by the XGB then LGBM.for the respective models.Amongst these models, the GBR model produced the largest differences between the training and test scores, hence the least generalization despite fitting the training data best.On the other hand, the CBT model generalizes best while also producing the best test performance across the different metrics.Although, the GBR model fits the training data best, in terms of test performance which is the true measure of model performance, CBT produced a superior performance to GBR, XGB and LGBM across all the error-based evaluation metrics with a performance improvement ranging from 17% to 22%, 16% to 20.4% and 12% to 20% in terms of RMSE,MAE and MAPE,respectively. Presented in Figs. 6,7,8 and 9 are the superimposed line plots of experimental and estimated compressive strengths for the training and test phases (a and b) alongside the corresponding error plots (c and d) for each of the considered gradient-boosting models.The errors for the training and test phases of each model are obtained by subtracting the estimated value of compressive strength for each data sample from its corresponding experimental value in the data sets.Since the aim of the model is to estimate the actual compressive strength as closely as possible, the lesser the deviation of the error plot from zero, the better.
It can be observed from the test error plots (Fig. 6d) that the CBT model shows the least deviation as it only deviates by error more than an |5| at only two occasions (sample indexes 66 and 87) compared to seven, five and three cases in GBR (sample indexes 35,64,66,69,71,75 and 86 as in Fig. 9d), XGB (sample indexes 17, 35, 58, 66 and 69, as shown in Fig. 7d) and LGBM (sample indexes 35, 66 and 75 as in Fig. 8d) models, respectively.
It is noteworthy that while all the models, respectively, exceeded |5| error mark on sample index 66, the GBR model notably deviated by |11| on this sample index; making it the least performing model in this regard.

Average Performance of Models
To further ensure that the performance of the gradientboosted machine learning algorithms compared in this study is not by chance, the same experiment was repeated 100 times for each of the models using the same set of optimal hyperparameters presented in Table 2.The original data was repeatedly split into training-test partitions for different repetitions of the experiment using different random seeds to ensure that different sets of training and test samples were used each time over the whole process.The mean and standard deviation of the training and test performances of each of GBR, XGB, LGBM and CBT over the 100 repetitions are presented in Fig. 10 for each statistical evaluation measures.As expected, and hinted earlier, the average training performance of each model is generally better than the corresponding average test performance across the evaluation metrics with GBR mostly performing best in this regard followed by CBT.
Similarly, the training performance shows minimal deviation from their respective means compared to the test performance.In terms of the test performance, CBT (R 2 = 0.9506, RMSE = 3.6051, MAE = 2.2462, MAPE = 0.0774) generally produced the best average performance based on all evaluation metrics, whereas GBR (R 2 = 0.9444, RMSE = 3.8406, MAE = 2.4247, MAPE = 0.0836) ranks lowest in all but MAE and MAPE, where it shows comparable or slightly better performance than LGBM (R 2 = 0.9467, RMSE = 3.7644, MAE = 2.4386, MAPE = 0.0862) and XGB (R 2 = 0.9468, RMSE = 3.7638, MAE = 2.4371, MAPE = 0.0854) on average.Although, XGB marginally outperform LGBM on the specific result presented in Table 3, the average performance of XGB and LGBM are mostly similar with XGB slightly performing better over the hundred repetitions.Overall, CBT ranks best on the average, followed by XGB, LGBM, then GBR across all the evaluation measures.

Statistical Analysis of Results
In addition, a statistical analysis of the obtained results in terms of R 2 and RMSE is presented here.Using the test results from 100 repetitions of experiments from the preceding section, the null hypothesis of the Friedman's test is rejected given p values of 0.00000 (less than significance level of 0.05) for both R 2 and RMSE results, respectively.The Friedman's ranking tests for both R 2 and RMSE rank the gradient boosting ensembles algorithms similarly in descending order as follows, CBT > XGB > LGBM > GBR.While this ranking signifies that CBT

Sample Index
LGBM   for all pairwise combination except LGBM vs XGB for both evaluation metrics.This shows that, although XGB ranks higher than LGBM, the difference between them is not statistically significant.Conversely, CBT is significantly better than any other methods (Table 4).

Feature Importance
Being able to understand or interpret the decision or the cause of the decision a machine learning model makes is integral to improved human understanding of the data, the model and relationship between them.The quest for this has paved way for a whole new active area of research known as interpretable machine learning (Murdoch et al., 2019).Similarly, this section seeks to provide insight into the decision of each of the considered machine learning models in this study relative to the data set.While earlier works on compressive strength estimation have rarely explored this line of research, there has been a notable increase in studies exploring this line of research.Some of which have investigated the importance of input features in the prediction of mechanical properties of pervious concrete using extreme gradient boosting and support vector regression as well as Adaboost (Feng et al., 2020;Güçlüer et al., 2021;Mustapha et al., 2022).In this study, the feature importance function which can be called on each of the fitted models of the Python implementations of CatBoost, LightGBM, XGBoost and gradient-boosting regressor is used to get the contribution of each input feature to the respective models.Figs. 11,12,13 and 14, respectively, present a ranking of the input features for CBT, LGBM, XGB and GBR in  LGBM vs XGB 0.14617 0.88378 Accepted descending order of importance.There is consensus amongst all the models that the top three most important feature to the estimation of compressive strength are the Age (in days) of each of the concrete mixtures followed by the quantity of cement (in kg/m 3 ), then water (in kg/m 3 ).This confirms what has been reported in earlier studies that the compressive strength of concrete increases with time (Abdulkareem et al., 2019;Sharmila & Dhinakaran, 2016).At the bottom end of the feature importance ranking is coarse aggregate (in kg/m 3 ) with the least relevance to the predictive performance of XGB and LGBM, whereas the fly ash (in kg/m 3 ) component of each mixture has the least contribution to the predictive decision of the GBR and CBT models.These findings further corroborate what has been reported in pertinent works relating the importance of age, cement as well as water quantity in the estimation of compressive strength of concrete (Cakiroglu et al., 2023;Feng et al., 2020;Güçlüer et al., 2021).

Sensitivity Analysis
A sensitivity analysis of all the input variables employed in estimating the compressive strength is presented here to understand how changes in each input variable bring about corresponding changes in the estimated model outputs.It is noteworthy that while the correlation analysis presented in Fig. 2 can be viewed as a form of sensitivity analysis, it only represents the static relationship between each input variable and the output irrespective of the model.Here, the relationship between the input variables and the estimated output from the perspective of each model is presented.This is achieved by showing the marginal effect each feature has on the predicted outcome of GBR, CBT, LGBM and XGB models with the aid of partial dependence plots (PDP) (Hastie et al., 2009).
The PDP is a global method that considers all instances and gives a statement about the global relationship of a feature with the predicted outcome.In the current study, each gradient-boosting ensemble model has been fitted to estimate the compressive strength of concrete mixtures and PDP is used to visualize the relationships each model has learnt as presented in Fig. 15a-d for CBT, GBR, LGBM and XGB, respectively.It is interesting to note that the relationship between each input feature and the estimated output (compressive strength) exhibit similar trend across the gradientboosting models.For instance, the relationship between cement quantity and the estimated compressive strength is linear for all models, with increasing cement quantity yielding corresponding increase in compressive strength across the models.Similar pattern can be observed in relation to the age of the concrete mixtures albeit the compressive strength plateaus after about 100 days, indicating no significant increase in the compressive strength of the mixtures after this period.While the range of training compressive strength values (which is2.33-82.6MPa in this study) used for model building in highly influential to model estimations, representative works such as (Abdulkareem et al., 2019;Sharmila & Dhinakaran, 2016) alluded to slower increase in compressive strength of concrete mixtures after the first 3 months.On the other hand, an inverse relationship exists between the model estimations and water quantity across the models, with increase in water quantity from 150 to 200 kg/m 3 resulting in decrease in compressive strength.Interestingly, the estimated compressive strength does not decrease across the models when water quantity increases beyond 200 kg/m 3 .For other input features, such as fine aggregate and blast furnace slag, the estimated compressive strength slowly and marginally decreases as the former increases, while a marginally decreasing trend can be observed as the latter increases.The intuitive nature of the input-output relationships shown by the models reflect well the models learn from the given data.

Comparison with Previous Works
Given that compressive strength is one of the most important structural material properties in concrete research and design, several studies have developed intelligent approaches for its accurate estimation over the past years.A considerable number of these studies have used either part or whole of the Lichman (2013) data set used in this research.Hence, it is considered worthwhile to compare the results obtained herein with the best results that have been reported in pertinent studies.Admittedly, ensuring an objective comparison of performance with previous studies can be challenging, given the differences in statistical evaluation metrics, training-test split ratios (e.g., some may use 90:10 ratio, while others may use 70:30), sample size (e.g., some may use a subset of the data set, while others use the complete 1030 samples) and the general experimental setup.Notwithstanding, the comprehensive nature of the experiments carried out in this study naturally answers some of these concerns.Table 5 presents details of the representative studies grouped by experimental design, algorithm,  under the average performance category with the best results from studies in which experimental results were conducted using k-fold cross validation and the average performance reported, whereas the best results from studies that evaluate their models based on training-test cross validation are grouped under the cross validation category and compared with results presented in Table 3. Table 5 presents the comparison of obtained results with the best from previous studies.A general observation from the table is the extensive use of ensemble models and paucity of gradient-boosted models in compressive strength estimation of quaternary blend concrete.In terms of average performance, the best performance found in relevant studies was reported in Feng et al. (Feng et al., 2020), where the proposed Adaboost model yielded R 2 = 0.952, RMSE = 4.856 MPa, MAE = 3.205 MPa and MAPE = 0.114.Compared to the best average performance obtained in this study, the CBT model produced a better result in all the evaluation metrics (25.76%RMSE, 29.92% MAE and 32.46% MAPE improvements, respectively) except in terms of R 2 , where the score of 0.952 reported is marginally better than that average R 2 of 0.951 obtained over 100 repetitions (about 0.1% improvement).It should also be noted the average performances of GBR, XGB and LGBM in terms of RMSE, MAE and MAPE are also better than what was reported in (Feng et al., 2020).Likewise, the best cross validation performance found in the literature is R 2 = 0.982, RMSE = 2.20 MPa, MAE = 1.64 MPa and MAPE = 0.0678 reported in Feng et al. (Feng et al., 2020).In comparison with the best results obtained in this study, the R 2 , RMSE.MAE and MAPE values of 0.984, 2.071 MPa, 1.597 MPa and 0.063 are better with performance improvement of 0.2%, 5.86%, 2.62% and 0.48%, respectively.
The impressive performance of the gradient-boosting models presented in this study generally reflect the robustness each of each model to different evaluation approaches for compressive strength of quaternary blend concrete estimation.However, it should be noted the performance reported in this study is limited to 1030 concrete mix with age ranging from 1 to 365 days.

Conclusion
A comparative analysis of prediction of compressive strength of quaternary blend concrete with gradientboosted ensembles is presented in this study.Four popular gradient-boosting implementations, namely, gradient-boosting regressor (GBR), light gradient-boosting model (LGBM), extreme gradient boosting (XGB) and CatBoost (CBT) were, respectively, used to build models for compressive strength estimation and results based on an out-of-sample test set as well as average cross validation are presented.Four popular evaluation metrics were used for performance evaluation with results showing that CBT outperformed other methods across all the metrics with values of 0.9838, 2.0709, 1.5966 and 0.0629 as the R 2 , RMSE, MAE and MAPE values, respectively.An analysis of the most important features to model performance also shows that the age, quantity of cement and water in the concrete mixture have highest contributions to the compressive strength estimation of each model.In addition, a sensitivity analysis of the model prediction with varying values of input features confirms the importance of these features, notably showing no significant increase in compressive strength estimations after the first 100 days.Moreover, a comparison of results with findings from previous studies also shows the superiority of CBT and the other gradient-boosting models in estimating compressive strength.CBT not only outperform the models on single evaluation with an out of sample test but also in terms of average performance.It is hoped that these findings will further increase the awareness of the predictive capabilities of CBT amongst and thus, increase its use alongside the growing computational tools at their disposal.
This study, though comprehensive, is not without limitations.In relation to the data set, though, a fairly large representative one in concrete properties estimation, we acknowledge that machine learning models are only as good as their training data.Hence, the findings reported are based on the range of values reported in Sect.3.1.Besides, the data set is not representative of all types of concrete mixtures, such as the rubberized recycled aggregate concretes and heat-treated concretes (Cakiroglu et al., 2023;Chen et al., 2021).These are viable areas for future investigation.
In addition, the relentless quest for improved accuracy of concrete properties and specifically compressive strength estimation has led to innovative learning methods, such as advanced deep learning algorithms with specialised loss functions Hoang (2023) and metaheuristic optimized DNN (Ranjbar et al., 2022) as well as ensemble of ensemble models (Lee et al., 2023).While these methods have potential shortcomings that relates to computational cost and overfitting, future works will explore feature selection, using only top-ranking features that contribute most to each model performance as shown in the feature importance and sensitivity analysis.

Fig. 1
Fig. 1 Boxplots of distribution of compressive strength and input features of data sets 3.1) is data normalization.This is a common pre-processing stage in most machine learning pipeline to avoid numerical overflow while keeping the input variables within a uniform range.Due care has been taken to split the data into training and test partitions before data normalization to avoid data leakage(O'Neil & Schutt, 2013).All input variables were normalized, such that the values are within the range of -1 and 1.

Fig. 4
Fig. 4 Training and test performance of ML models with different training-test splits

Fig. 5
Fig. 5 Comparison of experimental and estimated compressive strength for the training and test phases of each model

Fig. 6
Fig. 6 Superimposed line plots of experimental and estimated compressive strength for a training and b test phases and corresponding error plots over the c training and d test data for CatBoost

Fig. 7
Fig. 7 Superimposed line plots of experimental and estimated compressive strength for a training and b test phases and corresponding error plots over the c training and d test data for LightGBM

Fig. 8
Fig. 8 Superimposed line plots of experimental and estimated compressive strength for a training and b test phases and corresponding error plots over the c training and d test data for XGBoost

Fig. 10
Fig. 10 Mean (± Standard Deviation) performance of gradient-boosted models over 100 repetitions of experiments

Fig. 15
Fig. 15 Partial dependence plots for the a CBT, b GBR, c LGBM and d XGB compressive strength estimation models

Table 1
Descriptive statistics of variables used in modelling

Table 2
Optimal hyperparameters for gradient-boosted models

Table 3
Training and testing performance of the models (↑ Higher is better, ↓ lower is better) Table 3 for each model are the respective training and test performances in terms of RMSE, MAE and MAPE.It is worthy of note that unlike R 2 , these statistical evaluation measures seek to approximate the errors between the experimental values and model estimations as described in Sect.4.3.Based on these metrics, the respective training and test performances of GBR (RMSE = 1.1826MPa and 2.6642 MPa; MAE = 0.4259 MPa and 1.9013 MPa; MAPE = 0.0148 and 0.0717), XGB (RMSE = 1.6016MPa and 2.4972 MPa; MAE = 0.9246 MPa and 1.9032 MPa; MAPE = 0.033 and 0.0744), LGBM (RMSE = 1.7578MPa and 2.5963 MPa; MAE = 1.0599MPa and 2.0067 MPa; MAPE = 0.0392 and 0.0788) and CBT (RMSE = 1.4045MPa and 2.0709 MPa; MAE = 0.7218 MPa and 1.5966 MPa; MAPE = 0.0256 and 0.0629) are very impressive given the respective generalization gaps of 1.4816 MPa, 0.8956 MPa, 0.8385 MPa and 0.6664 MPa in terms of RMSE, 1.4754 MPa, 0.9786 MPa, 0.9468 MPa and 0.8748 MPa in terms of MAE as well as 0.0569, 0.0414, 0.0396 and 0.0373 in terms of MAPE

Table 3 .
It can be observed that the null hypothesis at significance level of 0.05 is rejected

Table 4
Results of pairwise post-hoc analysis using Holm's test

Table 5
Comparison with previous studies