3.1 MAML Model Algorithm
For many years, the training sample size has been an important issue that cannot be avoided when using neural networks for prediction. The prediction ability of a neural network based on a database with a small sample size is unable to easily reach an ideal accuracy. However, in the big data era, there is almost no unique task in most of the existing prediction tasks that cannot find any similar tasks to provide certain prior knowledge. Taking the bond–slip model prediction of SRRC reinforced concrete as an example, although only a few tests have been performed before, the bond–slip performance between other types of rebar and concrete has been extensively studied. In fact, the function expression form of the bond–slip curve or the range of parameters of the expression can be easily changed with various influencing factors. However, some properties have common characteristics, such as the stronger the tensile strength of concrete is, the larger the maximum bond stress will be under the premise of no rebar yield and rib failure. Alternatively, the existence of stirrups will improve the ductility of the concrete and increase the residual bond stress. These common characteristics provide a good basis of knowledge for neural network learning. Therefore, when the sample size of the target prediction task is small, it is recommended to conduct prior knowledge learning on a large dataset with similar task objectives and then transfer it to the target task. In the field of deep learning, this problem is called few-shot learning, and the MAML algorithm has been proposed in the literature (Finn et al., 2017) to solve this problem.
The traditional DNN model is shown in Fig. 5, which can be described as follows and can be realized by Box 1.
For a certain distribution \(P(X,Y,\theta )\), X is the input sample, and Y is the output result. The dimensions of X are m × nx, where m is the number of samples and nx is the number of features of each sample. The dimensions of Y are m × ny, where ny is the number of features of each output, and \(\theta\) is the parameters of the neural network (containing weight, bias and other learning parameters). For each different task, \(\theta\) starts from random initialization, and then the predicted value \(\hat{y}\left( {X,\theta } \right)\) of each iteration step will be obtained after forward propagation. Generally, let the loss function of the regression problem be \(L = \frac{1}{m}\sum\nolimits_{i = 1}^{m} {(y - \hat{y}\left( {X,\theta } \right))^{2} }\) (measuring the error between the predicted value and the labeled value). Finally, the parameters are updated by calculating \(\theta^{\prime} = \theta - \alpha \nabla_{\theta } L\left( {X,Y,\theta } \right)\) until the most suitable \(\theta\) is found to describe the corresponding relationship between X and Y, where \(\nabla_{\theta }\) represents the gradient vector to \(\theta\), and \(\alpha\) represents the learning rate. However, in many engineering problems, the gradient descent process of neural networks is not the optimization process of smooth convex functions. If the number of samples is small and the distribution of new and old tasks is different, the network will have difficulty escaping from the saddle point or prematurely entering the local optimum, thereby obtaining lower prediction accuracy.
The traditional DNN training objects are the samples, namely, X itself. The goal of the MAML algorithm is to ‘learn how to learn’, and the training objects are tasks, that is, \(\theta\). In other words, the DNN framework training object is each sample point in BondSlipNet, and the MAML framework training object is first the tasks divided from BondSlipNet, and then is the sample in the specific prediction task. Therefore, the MAML algorithm can be divided into two parts. The first part is meta-learning, which uses large databases to train with different tasks and finally obtains \(\theta^{ * }\) with good generalization performance. Then, \(\theta^{ * }\) is used for a small sample database, and after a small amount of gradient update, the final model is obtained, which is the second part called the fine-tuning process.
The MAML algorithm is shown in Fig. 6, which can be described as follows and can be realized as Box 2.
The global task can be divided into B batches, and one batch will be extracted for updating each time. Assuming that the number of tasks in a batch is mB, the global model parameter is initialized to \(\theta\), and the training set R and test set R' are extracted from the mtth task. First, the training set R is trained, and the model parameter is updated by the gradient. Then, for the R' set, the loss is calculated, and the model parameter is updated. The loss function of the mtth task is set to \(l_{{m_{t} }}\), and the parameters for mB training tasks in this batch are initialized to \(\theta\). The parameters of the mtth task will be changed to \(\theta_{{m_{t} }}^{i}\) using iterative Eq. (1) after i times of updates:
$$\theta_{{m_{t} }}^{i} = \theta_{{m_{t} }}^{i - 1} - \alpha \nabla_{{\theta_{{m_{t} }}^{i - 1} }} l_{{m_{t} }} (\theta_{{m_{t} }}^{i - 1} ),$$
(1)
\(\alpha\) is the learning rate for each task.
After the training tasks in each batch are established, the loss function of the global model is set to \(L\left( \theta \right)\), then:
$$L\left( \theta \right) = \sum\limits_{{m_{t} = 1}}^{{m_{B} }} {l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)} = \sum\limits_{{m_{t} = 1}}^{{m_{B} }} {l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i - 1} - \alpha \nabla_{{\theta_{{m_{t} }}^{i - 1} }} l_{{m_{t} }} (\theta_{{m_{t} }}^{i - 1} )} \right)} .$$
(2)
\(L\left( \theta \right)\) is the functional of \(\theta\), where \(m_{B}\) represents the number of tasks processed in each batch. The global parameter \(\theta\) can be updated to \(\theta^{^{\prime}}\) using Eq. (3):
$$\theta^{\prime} = \theta - \beta \nabla_{\theta } L\left( \theta \right) = \theta - \beta \nabla_{\theta } \sum\limits_{{m_{t} = 1}}^{{m_{B} }} {l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)} = \theta - \beta \sum\limits_{{m_{t} = 1}}^{{m_{B} }} {\nabla_{\theta } l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)} .$$
(3)
Note that both \(\theta \left( {w_{j} } \right),j = 1,2,3,...,k\) and \(\theta_{{m_{t} }}^{i} \left( {w_{{l \cdot m_{t} }}^{i} } \right),l = 1,2,3, \ldots ,k\) contain k parameters, and the second item of Eq. (3) is written as Eq. (4):
$$\nabla_{\theta } l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right) = \left[ {\begin{array}{*{20}c} {\partial l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)/\partial w_{1} } \\ \vdots \\ {\partial l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)/\partial w_{j} } \\ \vdots \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} {\sum\limits_{l = 1}^{k} {\frac{{\partial l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)}}{{\partial w_{{l \cdot m_{t} }}^{i} }}\frac{{\partial w_{{l \cdot m_{t} }}^{i} }}{{\partial w_{1} }}} } \\ \vdots \\ {\sum\limits_{l = 1}^{k} {\frac{{\partial l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)}}{{\partial w_{{l \cdot m_{t} }}^{i} }}\frac{{\partial w_{{l \cdot m_{t} }}^{i} }}{{\partial w_{j} }}} } \\ \vdots \\ \end{array} } \right],$$
(4)
\(w\) is the component of array \(\theta\), which refers to a single value of weight or bias. Similarly, Eq. (1) can be rewritten as Eq. (5):
$$\begin{array}{*{20}l} {w_{{l \cdot m_{t} }}^{1} = w_{l} - \alpha \frac{{\partial \left( {l_{{m_{t} }} (\theta )} \right)}}{{\partial w_{l} }}} \\ \vdots \\ {w_{{l \cdot m_{t} }}^{i} = w_{{l \cdot m_{t} }}^{i - 1} - \alpha \frac{{\partial \left( {l_{{m_{t} }} (\theta_{{m_{t} }}^{i - 1} )} \right)}}{{\partial w_{{l \cdot m_{t} }}^{i - 1} }}} \\ \vdots \\ \end{array} .$$
(5)
Combining Eqs. (4), (5), \(\frac{{\partial w_{l \cdot m_{t}}^{i}}}{{\partial w_{j} }} = \frac{{\partial w_{{_{l} }} }}{{\partial w_{j} }} - \alpha \left( {\frac{{\partial^{2} l_{{m_{t} }} (\theta )}}{{\partial w_{{_{l} }} \partial w_{j} }} + \sum\limits_{s = 1}^{i - 1} {\frac{{\partial^{2} l_{{m_{t} }} (\theta_{{m_{t} }}^{s} )}}{{\partial w_{l \cdot m_{t}}^{s}\partial w_{j} }}} } \right)\) can be obtained. Then, each component \(w^{\prime}_{j}\) of \(\theta^{\prime}\left( {w^{\prime}_{j} } \right),j = 1,2,3,...,k\) in Eq. (3) can be simplified as Eq. (6):
$$w^{\prime}_{j} = w_{j} - \beta \sum\limits_{{m_{t} = 1}}^{{m_{B} }} {\sum\limits_{l = 1}^{k} {\frac{{\partial l_{{m_{t} }} \left( {\theta_{{m_{t} }}^{i} } \right)}}{{\partial w_{{l \cdot m_{t} }}^{i} }}\left( {\frac{{\partial w_{{_{l} }} }}{{\partial w_{j} }} - \alpha \left( {\frac{{\partial^{2} l_{{m_{t} }} (\theta )}}{{\partial w_{l} \partial w_{j} }} + \sum\limits_{s = 1}^{i - 1} {\frac{{\partial^{2} l_{{m_{t} }} (\theta_{{m_{t} }}^{s} )}}{{\partial w_{{l \cdot m_{t} }}^{s} \partial w_{j} }}} } \right)} \right)} } ,$$
(6)
It should be noted that the MAML has made two simplifications in deriving Eq. (6), that is, only one gradient update is conducted in meta-learning, and the role of the Hessian is ignored. Through mathematical derivation and computational experimental analysis, it can be concluded that for implicit learning tasks, these two simplifications may reduce the accuracy of the training network, and the specific explanations will be given in Sect. 3.3.2 and Sect. 4.3.
At this point, the training of one batch in the MAML algorithm is completed, and then the aforementioned process is repeated to train and update the remaining batches until all batches are input into the network to complete the training, which is regarded as the completion of one epoch. It should be noted that in each epoch, the selection of task sets is random, which has a similar effect as cross-validation. After several epochs, the meta-learning process is completed. Based on the obtained optimal initialization parameters, traditional DNN training is performed here for the specific tasks using a database with a small sample size. Then, the parameters of the shallow layer in the DNN are frozen, and only the parameters in the subsequent layer are trained, after which the fine-tuning process is finally completed. The shallow layer represents the first several layers, and the subsequent layer represents the last several layers. In this study, the parameters in the shallow layer indicate those between the Input layer, the Hidden layer 1 and Hidden layer 2. The parameters in the subsequent layer indicate those between the Hidden layer 2, Hidden layer 3 and Output layer.
3.2 Modification of the MAML–MMN
The original MAML algorithm was used to train the regression problem with the sine function as an example, which obtained good results. However, for the vast majority of engineering problems, an explicit relationship is often unavailable between the input and output of data. The training of the network is not only aimed at obtaining the regression of several parameters but also needs to determine the number of parameters and even the function itself or the implicit relationship between the data. At the same time, many regression problems in engineering have larger input feature dimensions, output feature dimensions and smaller sample sizes (compared with those of image classification problems). Therefore, many problems arise when directly using the MAML algorithm to predict the bond–slip model of reinforced concrete (more details are provided in Sect. 4).
In this paper, the MMN algorithm is created according to the particularity of the implicit regression problem of the bond–slip model, as shown in Fig. 7.
Compared with the MAML algorithm, the following improvements are incorporated in the MMN algorithm.
(1) For the expression of the loss function, the single-layer perceptron is used to modify the Mahalanobis distance loss, replacing the mean square error (MSE) loss.
(2) Multiple gradient updating is considered in meta-learning.
(3) The overall framework is changed into a multitask learning framework. The output task is divided into two tasks, namely, the prediction of the slip stage curve and the prediction of the failure stage curve, which use joint learning (Sun et al., 2020). The multitask learning framework plays a dimension reduction role in learning tasks. Furthermore, when using a multitask learning framework for joint training, the different tasks establish linkages between the minimum value through shared parameter constraints. Thus, the multitask learning framework has a parameter sharing mechanism, which improves the generalization of the network and reduces the risk of overfitting. Dropout (Srivastava et al., 2014) and L2 regularization technology (Rahaman et al., 2018) were added to improve the generalization level of the network, and gradient clip (Zhang et al., 2020a, 2020b) was added to avoid gradient explosion. BN normalization is changed to FRN normalization, which also prevent the model from overfitting (Singh & Krishnan, 2020).
The MMN algorithm can be realized using the Box 3. Mathematical derivation will be performed in Sect. 3.3 to analyze its improvement significance. Among them, the improvements in (3) have already been explained by many studies. Therefore, the next section mainly focuses on the explanation of improvement points (1)–(2).
3.3 Mathematical Explanation of the MMN
3.3.1 Modified Mahalanobis Distance Loss
Let the output results of the MMN equal \(\hat{y}_{a \times b}\), where a represents the sample size and b represents the output feature dimension. Assuming that the number of predicted feature points in this task is m, then b = 2 × m. \(\hat{y}_{a \times b}\) is rewritten as matrix \(\hat{y}_{{2 \times \frac{ab}{2}}}\) with 2 rows and ab/2 columns. Then, the first row of \(\hat{y}_{{2 \times \frac{ab}{2}}}\) is the abscissa (slip Si) of the predicted points, and the second row is the ordinate (bond stress Ti). Similarly, the labels in the dataset are set as \(y_{{2 \times \frac{ab}{2}}}\).
The Mahalanobis distance loss between the output and the label item can be calculated by Eq. (7):
$$l_{{m_{t} }} \left( {y_{{2 \times \frac{ab}{2}}} ,\hat{y}_{{2 \times \frac{ab}{2}}} } \right) = \overline{tr} \left[ {\left( {\hat{y}_{{2 \times \frac{ab}{2}}} - y_{{2 \times \frac{ab}{2}}} } \right)^{T} C_{{}}^{{ - 1}} \left( {\hat{y}_{{2 \times \frac{ab}{2}}} - y_{{2 \times \frac{ab}{2}}} } \right)} \right],$$
(7)
\(\overline{tr}\) is defined as an operator for finding the mean value after taking diagonal elements and \(C_{{}}^{{ - 1}}\) is the inverse matrix of the covariance matrix of matrix \(Y_{2 \times ab} = \left[ {\hat{y}_{{2 \times \frac{ab}{2}}} ,y_{{2 \times \frac{ab}{2}}} } \right]\). Equation (7) shows that the Mahalanobis distance is equivalent to the following progress. First, principal component analysis (PCA) of the data points in the sample space will be performed. Then, the sample space is rotated according to the principal component so that the dimensions are independent of each other. Finally, the distance between the sample points can be obtained by standardization.
Only when the covariance matrix is a unit matrix, that is, when each dimension is independent and identically distributed, the Mahalanobis distance degenerates to the Euclidean distance. Thus, it can be seen that the Euclidean distance treats the relationship between the dimensions of the sample ‘fairly’, ignoring the different distribution characteristics between various dimensions. For the implicit regression problem of the bond–slip model, the mapping between the input features and the output coordinate points has multiple expressions or can be regarded as an implicit relationship. The horizontal and vertical coordinates of the output coordinate point have different practical significance and exhibit different distribution characteristics. For example, it is assumed that there is such a set of data points as shown in Fig. 8 that satisfy the distribution P (x, y) of the predicted bond–slip model. Among them, the coordinate of point A is (2.5 mm, 5.8 MPa), and the initial output of that point is A1 (5.5 mm, 6.8 MPa), and the second output is A2 (3.5 mm, 2.8 MPa), both of which have the same Euclidean distance between points A. However, it is obvious that A2 is more likely to meet the distribution P (x, y).
Equation (7) shows that the Mahalanobis distance between two points is independent of the measurement unit, which can eliminate the interference of correlation between variables. However, the Mahalanobis distance can be easily affected by outlier samples, thus sacrificing the overall accuracy.
Therefore, the single-layer perceptron was introduced to modify the matrix \(C_{{}}^{{ - 1}}\), which is one type of neutral network with only one layer. In essence, to weaken the effect of the outliers, the modified Mahalanobis distance using single-layer perceptron enables the element values of covariance matrix of the original Mahalanobis distance to participate in the training process. Since the number of outliers is generally small, and the covariance matrix of Mahalanobis distance is a 2 × 2 matrix, the learning ability of single-layer perceptron is sufficient for this task. Therefore, the modified Mahalanobis distance obtained by single-layer perceptron has strong anti-noise ability. At the same time, the complexity of the model will not be increased substantially due to the small number of parameters introduced by the single-layer perceptron, which is convenient to ensure the training speed and generalization ability. Let \(C_{{}}^{{ - 1}} = \left[ {\begin{array}{*{20}c} {c_{11} } & {c_{12} } \\ {c_{21} } & {c_{22} } \\ \end{array} } \right]\), \(C^{\prime - 1} = \left[ {\begin{array}{*{20}ll} {\eta \tan h\left( {\omega_{11} c_{11} + \delta_{11} } \right)} & {\eta \tan h\left( {\omega_{12} c_{12} + \delta_{12} } \right)} \\ {\eta \tan h\left( {\omega_{21} c_{21} + \delta_{21} } \right)} & {\eta \tan h\left( {\omega_{22} c_{22} + \delta_{22} } \right)} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {c^{\prime}_{11} } & {c^{\prime}_{12} } \\ {c^{\prime}_{21} } & {c^{\prime}_{22} } \\ \end{array} } \right]\), where the coefficient \(\eta\) limits the parameter interval, which constrains the effect of the outliers on \(C_{{}}^{{ - 1}}\). The perceptron is initialized by \(C_{{}}^{{ - 1}} = \left[ {\begin{array}{*{20}c} {c_{11} } & {c_{12} } \\ {c_{21} } & {c_{22} } \\ \end{array} } \right]\), the learning parameters are \(\omega_{ij}\), \(\delta_{ij}\) (i, j = 1,2) and \(\eta\), and then \(C_{{}}^{{ - 1}}\) is updated to further learn the distribution of data.
The corrected \(C^{\prime - 1}\) is substituted into Eq. (7). Equation (8) can be obtained as follows:
$$l_{{m_{t} }} (y_{{2 \times \frac{ab}{2}}} ,\hat{y}_{{2 \times \frac{ab}{2}}} ) = \overline{tr} \left[ {\left( {\begin{array}{*{20}c} \vdots \vdots \\ {\Delta S_{i} }{\Delta \tau_{i} } \\ \vdots \vdots \\ \end{array} } \right)\left[ {\begin{array}{*{20}c} {c^{\prime}_{11} } & {c^{\prime}_{12} } \\ {c^{\prime}_{21} } & {c^{\prime}_{22} } \\ \end{array} } \right]\left( {\begin{array}{*{20}c} \cdots {\Delta S_{i} } \cdots \\ \cdots {\Delta \tau_{i} } \cdots \\ \end{array} } \right)} \right].$$
(8)
Then, let \(W = c^{\prime}_{11} \Delta S_{i}^{2} + \left( {c^{\prime}_{12} + c^{\prime}_{21} } \right)\Delta S_{i} \Delta \tau_{i} + c^{\prime}_{22} \Delta \tau_{i}^{2}\). Thus, in addition to weighting the slip deviation and bond stress deviation in the Euclidean distance, the product of bond stress deviation and slip deviation is also considered. Let \(S\sim \left( {\overline{S} \pm \sigma (S)} \right)\) and \(\tau \sim \left( {\overline{\tau } \pm \sigma (\tau )} \right)\), if \(\sigma (S) \approx \sigma (\tau )\), the data dispersion degree of the two distributions is approximate, and then the role of \(\Delta S_{i} \Delta \tau_{i}\) can be ignored (and vice versa), which should also validate the above analysis.
Finally, \(W = c^{\prime}_{11} \Delta S_{i}^{2} + \left( {c^{\prime}_{12} + c^{\prime}_{21} } \right)\Delta S_{i} \Delta \tau_{i} + c^{\prime}_{22} \Delta \tau_{i}^{2}\) can be regarded as a cone curve about \(\Delta S_{i}\) and \(\Delta \tau_{i}\), so it must satisfy W > 0 to have practical significance. Therefore, the modified Mahalanobis distance loss function proposed in this paper can add the following strengthened constraint condition: \(\begin{array}{*{20}c} { - 2 < \frac{{\left( {c^{\prime}_{12} + c^{\prime}_{21} } \right)}}{{\sqrt {c^{\prime}_{11} c^{\prime}_{22} } }} < 2} & {and} & {c^{\prime}_{11} > 0} \\ \end{array}\).
3.3.2 Hessian Matrix
In Sect. 3.1, the parameter updating Eq. (6) in meta-learning is derived. The Hessian matrix term essentially measures the second-order sensitivity of the loss function \(l_{{m_{t} }} (\theta )\) of each task in the meta-learning stage to the network parameters. The Hessian matrix term reflects the curvature of the global loss function and can help escape saddle points and local minima during gradient descent. In addition, in Sect. 3.3.1, the following conclusion is derived: for the implicit regression problem of the bond–slip model, the loss function \(l_{{m_{t} }} (\theta )\) is related to \(\Delta S_{i} \Delta \tau_{i}\), which means that there is a certain internal connection between the neurons in the output layer. In addition to sharing certain parameters between the layers, a second-order effect is observed between the parameters of some neurons. Therefore, the Hessian matrix term cannot be omitted.
For the explicit regression problem, since the number of parameters to be regressed is known, the network parameter space may search for the direction of global parameters according to the gradient direction of the update results of the internal task gradient of meta-learning, as shown in Fig. 9. However, for implicit regression problems, it is necessary to use the Hessian matrix correction term to search the parameter space of the mapping mode according to the curvature variation characteristics of the loss function, and then the direction of updating the global parameters in this parameter space is searched.