Common use of Model Validation Clause in Contracts

Model Validation. We examined the validation approach for each of the 34 outcomes (clinical endpoints of the studies). Single random split was used 17 times (50.0%), with the data split into single train-test or train-validation-test parts. When the data are split into train-test parts the best model for training data is chosen based on model’s performance on test data, whereas when the data are split into train-validation-test sets the best model for training data is selected based on the performance of the model on validation data. Then the test data are used to internally validate the performance of the model on new patients. Resampling (cross-validation or nested cross-validation) was used 9 times (26.5%). External validation (testing the original prediction model in a set of new patients from a different year, location, country etc.) was used 4 times (11.8%). External validation involved the chronological split of data into training and test parts 3 times (temporal validation), and validation of a new dataset 1 time. Multiple random split was used 2 times (5.9%), with the data split into train-test or train-validation-test data multiple times. Validation was not performed for 2 datasets (5.9%). We recommend reporting the steps of the validation approach in detail, to avoid misconceptions. In case of complex procedures, a comprehensive representation of the validation procedures can be insightful. Researchers should aim at performing both internal and external validations, if possible, to maximize the reliability of the prediction models. Table 5.3 shows the performance measures used for model validation in the 24 studies. A popular measure in the survival field, the C-index, was employed in 8 studies (33.3%, as C-index or time-dependent C-index) and AUC in 5 studies (20.8%). Notably, during the screening process, several manuscripts were identified where AUC and C-statistic were used interchangeably. While there is a link between the dynamic time-dependent AUC and the C-index (the AUC can be interpreted as a concordance index employed to assess model discrimination) [55], the two are not identical and some caution is required. Apart from the C-index, there was no other established measure in the 24 studies (large variability). This issue is of paramount importance as validation (and development) of the SNNs depends on a suitable performance measure. Any candidate measure should take into account the censoring mechanism. By employing performance measures that are commonly used in traditional classification ANNs, such as accuracy, some SNNs were suboptimally validated. Consistency in the use of performance measures should also be considered. In the simulation study of ▇▇▇▇▇▇▇▇▇ et al. in 2013 [43], hyperparameter values for PLANN were based on the Bayesian Information Criterion (BIC), while validation of the SNN performance on the test data was performed using the Mean Squared Error (MSE), and the comparison with Cox model was based on the C-index. Proper measures should be employed for model development and validation of time-to-event data (see the book of van Houwelingen and Putter [5]). Reporting of confidence intervals for the predictive measures was examined; 13 studies (54.2%) did not provide confidence intervals. Repeated data resampling was practiced in 6 studies (25.0%). The following remaining approaches were observed: repeating the simulations 500 times; rerunning the SNN 10 times for each covariate; and using a non-parametric confidence interval based on Gaussian approximation (4.2% each). The method of choice was unclear in 2 studies (8.3%). There is a strong need for the development of methods which reflect the amount of uncertainty of an evaluation criterion. This would provide additional insights into the predictive accuracy of the model. Another important aspect of a prediction model is calibration. It refers to the agreement between observed sur- vival probabilities estimated with ▇▇▇▇▇▇-▇▇▇▇▇’▇ methodology and the predicted outcomes. Typically, a plot is produced where the subjects are divided into 10 groups based on the deciles of predicted probabilities. Observed survival probabilities are plotted against predicted. In this review, calibration plots were available for only 11 stud- ies (45.9%). Calibration of the SNNs was not assessed in most studies, and as such a neutral comparison with the ▇▇▇ proportional hazards model could not be established. This is in accordance with the findings of ▇▇▇▇▇▇▇▇▇▇▇▇▇ et al. (2019) [56], which pinpoint an urgent need for more attention in calibration of modern ML techniques versus traditional regression methods to achieve a fair model comparison in the classification setting. Performance criterion N (%) C-index 7 (29.2%) AUC 5 (20.8%) log-likelihood 3 (12.5%) Accuracy 2 (8.3%) Global Chi-squared statistic of ▇▇▇ regression 2 (8.3%) Brier Score 1 (4.2%) Comparison of predicted probabilities with ▇▇▇▇▇▇-▇▇▇▇▇ 1 (4.2%) Integrated Brier Score (IBS) 1 (4.2%) Mean Absolute Error (MAE) 1 (4.2%) ▇▇▇▇▇▇▇’▇ test 1 (4.2%) Mean Squared Error (MSE) 1 (4.2%) Prognostic risk group discrimination 1 (4.2%) Sensitivity 1 (4.2%) Separation of cases into good and bad prognosis 1 (4.2%) Specificity 1 (4.2%) Survival curves comparison with log-rank test 1 (4.2%) Time-dependent C-index (Ctd) 1 (4.2%) Wilcoxon test (separation of cases into good and bad prognosis) 1 (4.2%) Table 5.3: The performance measures used for model validation across the 24 studies.

Appears in 1 contract

Sources: Analysis of Sarcoma and Non Sarcoma Clinical Data With Statistical Methods and Machine Learning Techniques

Model Validation. We examined the validation approach for each of the 34 outcomes (clinical endpoints of the studies). Single random split was used 17 times (50.0%), with the data split into single train-test or train-validation-test parts. When the data are split into train-test parts the best model for training data is chosen based on model’s performance on test data, whereas when the data are split into train-validation-test sets the best model for training data is selected based on the performance of the model on validation data. Then the test data are used to internally validate the performance of the model on new patients. Resampling (cross-validation or nested cross-validation) was used 9 times (26.5%). External validation (testing the original prediction model in a set of new patients from a different year, location, country etc.) was used 4 times (11.8%). External validation involved the chronological split of data into training and test parts 3 times (temporal validation), and validation of a new dataset 1 time. Multiple random split was used 2 times (5.9%), with the data split into train-test or train-validation-test data multiple times. Validation was not performed for 2 datasets (5.9%). We recommend reporting the steps of the validation approach in detail, to avoid misconceptions. In case of complex procedures, a comprehensive representation of the validation procedures can be insightful. Researchers should aim at performing both internal and external validations, if possible, to maximize the reliability of the prediction models. Table 5.3 shows the performance measures used for model validation in the 24 studies. A popular measure in the survival field, the C-index, was employed in 8 studies (33.3%, as C-index or time-dependent C-index) and AUC in 5 studies (20.8%). Notably, during the screening process, several manuscripts were identified where AUC and C-statistic were used interchangeably. While there is a link between the dynamic time-dependent AUC and the C-index (the AUC can be interpreted as a concordance index employed to assess model discrimination) [55], the two are not identical and some caution is required. Apart from the C-index, there was no other established measure in the 24 studies (large variability). This issue is of paramount importance as validation (and development) of the SNNs depends on a suitable performance measure. Any candidate measure should take into account the censoring mechanism. By employing performance measures that are commonly used in traditional classification ANNs, such as accuracy, some SNNs were suboptimally validated. Consistency in the use of performance measures should also be considered. In the simulation study of ▇▇▇▇▇▇▇▇▇ et al. in 2013 [43], hyperparameter values for PLANN were based on the Bayesian Information Criterion (BIC), while validation of the SNN performance on the test data was performed using the Mean Squared Error (MSE), and the comparison with Cox ▇▇▇ model was based on the C-index. Proper measures should be employed for model development and validation of time-to-event data (see the book of van Houwelingen and Putter [5]). Reporting of confidence intervals for the predictive measures was examined; 13 studies (54.2%) did not provide confidence intervals. Repeated data resampling was practiced in 6 studies (25.0%). The following remaining approaches were observed: repeating the simulations 500 times; rerunning the SNN 10 times for each covariate; and using a non-parametric confidence interval based on Gaussian approximation (4.2% each). The method of choice was unclear in 2 studies (8.3%). There is a strong need for the development of methods which reflect the amount of uncertainty of an evaluation criterion. This would provide additional insights into the predictive accuracy of the model. Another important aspect of a prediction model is calibration. It refers to the agreement between observed sur- vival probabilities estimated with ▇▇▇▇▇▇-▇▇▇▇▇’▇ methodology and the predicted outcomes. Typically, a plot is produced where the subjects are divided into 10 groups based on the deciles of predicted probabilities. Observed survival probabilities are plotted against predicted. In this review, calibration plots were available for only 11 stud- ies (45.9%). Calibration of the SNNs was not assessed in most studies, and as such a neutral comparison with the ▇▇▇ proportional hazards model could not be established. This is in accordance with the findings of ▇▇▇▇▇▇▇▇▇▇▇▇▇ et al. (2019) [56], which pinpoint an urgent need for more attention in calibration of modern ML techniques versus traditional regression methods to achieve a fair model comparison in the classification setting. Performance criterion N (%) C-index 7 (29.2%) AUC 5 (20.8%) log-likelihood 3 (12.5%) Accuracy 2 (8.3%) Global Chi-squared statistic of ▇▇▇ regression 2 (8.3%) Brier Score 1 (4.2%) Comparison of predicted probabilities with ▇▇▇▇▇▇-▇▇▇▇▇ 1 (4.2%) Integrated Brier Score (IBS) 1 (4.2%) Mean Absolute Error (MAE) 1 (4.2%) ▇▇▇▇▇▇▇’▇ test 1 (4.2%) Mean Squared Error (MSE) 1 (4.2%) Prognostic risk group discrimination 1 (4.2%) Sensitivity 1 (4.2%) Separation of cases into good and bad prognosis 1 (4.2%) Specificity 1 (4.2%) Survival curves comparison with log-rank test 1 (4.2%) Time-dependent C-index (Ctd) 1 (4.2%) Wilcoxon test (separation of cases into good and bad prognosis) 1 (4.2%) Table 5.3: The performance measures used for model validation across the 24 studies.

Appears in 1 contract

Sources: Analysis of Sarcoma and Non Sarcoma Clinical Data With Statistical Methods and Machine Learning Techniques