Model Uncertainty Quantification: A Post Hoc Calibration Approach for Heart Disease Prediction
Journal of Engineering Research and Sciences, Volume 4, Issue 12, Page # 25-54, 2025; DOI: 10.55708/js0412003
Keywords: Heart disease prediction, Machine learning, Probability calibration, Isotonic regression, Platt scaling, Temperature scaling, Uncertainty quantification, Expected calibration error (ECE), Brier score, Log loss, Spiegelhalter’s test, Reliability diagram, Post hoc calibration.
(This article belongs to the Section Artificial Intelligence – Computer Science (AIC))
Export Citations
Cite
Odesola, P. A. , Adegoke, A. A. and Babalola, I. (2025). Model Uncertainty Quantification: A Post Hoc Calibration Approach for Heart Disease Prediction. Journal of Engineering Research and Sciences, 4(12), 25–54. https://doi.org/10.55708/js0412003
Peter Adebayo Odesola, Adewale Alex Adegoke and Idris Babalola. "Model Uncertainty Quantification: A Post Hoc Calibration Approach for Heart Disease Prediction." Journal of Engineering Research and Sciences 4, no. 12 (December 2025): 25–54. https://doi.org/10.55708/js0412003
P.A. Odesola, A.A. Adegoke and I. Babalola, "Model Uncertainty Quantification: A Post Hoc Calibration Approach for Heart Disease Prediction," Journal of Engineering Research and Sciences, vol. 4, no. 12, pp. 25–54, Dec. 2025, doi: 10.55708/js0412003.
We investigated whether post-hoc calibration improves the trustworthiness of heart-disease risk predictions beyond discrimination metrics. Using a Kaggle heart-disease dataset (n = 1,025), we created a stratified 70/30 train-test split and evaluated six classifiers, Logistic Regression, Support Vector Machine, k-Nearest Neighbors, Naive Bayes, Random Forest, and XGBoost. Discrimination was quantified by stratified 5-fold cross-validation with thresholds chosen by Youden’s J inside the training folds. We assessed probability quality before and after Platt scaling, isotonic regression, and temperature scaling using Brier score, Expected Calibration Error with equal-width and equal-frequency binning, Log Loss, reliability diagrams with Wilson intervals, and Spiegelhalter’s Z and p. Uncertainty was reported with bootstrap 95% confidence intervals, and calibrated versus uncalibrated states were compared with paired permutation tests on fold-matched deltas. Isotonic regression delivered the most consistent improvements in probability quality for Random Forest, XGBoost, Logistic Regression, and Naive Bayes, lowering Brier, ECE, and Log Loss while preserving AUC ROC in cross-validation. Support Vector Machine and k-Nearest Neighbors were best left uncalibrated on these metrics. Temperature scaling altered discrimination and often increased Log Loss in this structured dataset. Sensitivity analysis showed that equal-frequency ECE was systematically smaller than equal-width ECE across model-calibration pairs, while preserving the qualitative ranking of methods. Reliability diagrams built from out-of-fold predictions aligned with the numeric metrics, and Spiegelhalter’s statistics moved toward values consistent with better absolute calibration for the models that benefited from isotonic regression. The study provides a reproducible, leakage-controlled workflow for evaluating and selecting calibration strategies in structured clinical feature data.
- World Health Organization, “Cardiovascular diseases (CVDs),” World Health Organization, Jul. 2025.
- Dey, P. J. Slomka, P. Leeson, D. Comaniciu, M. L. Bots, “Artificial intelligence in cardiovascular imaging: JACC state-of-the-art review,” Journal of the American College of Cardiology, vol. 73, no. 11, pp. 1317–1335, 2019, doi: 10.1016/j.jacc.2018.12.054.
- Srinivasan, S. Gunasekaran, S. K. Mathivanan, “An active learning machine technique-based prediction of cardiovascular heart disease from UCI-repository database,” Scientific Reports, vol. 13, art. no. 13588, 2023, doi: 10.1038/s41598-023-40717-1.
- F. Weng, J. Reps, J. Kai, J. M. Garibaldi, N. Qureshi, “Can machine-learning improve cardiovascular risk prediction using routine clinical data?,” PLOS ONE, vol. 12, no. 4, art. no. e0174944, 2017, doi: 10.1371/journal.pone.0174944.
- Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, “On calibration of modern neural networks,” in Proc. 34th Int. Conf. Machine Learning (ICML), 2017, pp. 1321–1330, doi: 10.48550/arXiv.1706.04599.
- Jiang, B. Kim, M. Y. Guan, M. Gupta, “To trust or not to trust a classifier,” in Proc. 32nd Int. Conf. Neural Information Processing Systems (NeurIPS), Montreal, Canada, 2018, pp. 5546–5557.
- Zadrozny, C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002, pp. 694–699, doi: 10.1145/775047.775151.
- Tonekaboni, S. Joshi, M. D. McCradden, A. Goldenberg, “What clinicians want: contextualizing explainable machine learning for clinical end use,” in Proc. 4th Machine Learning for Healthcare Conf. (PMLR, vol. 106), 2019, pp. 359–380. [Online]. Available: https://proceedings.mlr.press/v106/tonekaboni19a.html
- Niculescu-Mizil, R. Caruana, “Predicting good probabilities with supervised learning,” in Proc. 22nd Int. Conf. Machine Learning (ICML), 2005, pp. 625–632, doi: 10.1145/1102351.1102430.
- M. Ali, B. K. Paul, K. Ahmed, F. M. Bui, J. M. W. Quinn, and M. A. Moni, “Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison,” Computers in Biology and Medicine, vol. 136, art. no. 104672, 2021, doi: 10.1016/j.compbiomed.2021.104672.
- Ghosh, M. A. Islam, “Performance evaluation and comparison of heart disease prediction using machine learning methods with elastic net feature selection,” American Journal of Applied Mathematics and Statistics, vol. 11, no. 2, pp. 35–49, 2023, doi: 10.12691/ajams-11-2-1.
- N. Ahmad, Shafiullah, H. Fatima, M. Abbas, O. Rahman, Imdadullah, M. S. Alqahtani, “Mixed machine learning approach for efficient prediction of human heart disease by identifying the numerical and categorical features,” Applied Sciences, vol. 12, no. 15, art. no. 7449, 2022, doi: 10.3390/app12157449.
- Sayadi, V. Varadarajan, F. Sadoughi, S. Chopannejad, M. Langarizadeh, “A machine learning model for detection of coronary artery disease using noninvasive clinical parameters,” Life, vol. 12, no. 11, art. no. 1933, 2022, doi: 10.3390/life12111933.
- Deng, K. Lu, H. Hu, “An interpretable LightGBM model for predicting coronary heart disease: Enhancing clinical decision-making with machine learning,” PLOS ONE, vol. 20, no. 9, art. no. e0330377, 2025, doi: 10.1371/journal.pone.0330377.
- El-Sofany, B. Bouallegue, Y. M. A. El-Latif, “A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method,” Scientific Reports, vol. 14, art. no. 23277, 2024, doi: 10.1038/s41598-024-74656-2.
- -D. Samaras, S. Moustakidis, I. D. Apostolopoulos, N. Papandrianos, E. Papageorgiou, “Classification models for assessing coronary artery disease instances using clinical and biometric data: an explainable man-in-the-loop approach,” Scientific Reports, vol. 13, art. no. 6668, 2023, doi: 10.1038/s41598-023-33500-9.
- Vu et al., “Machine learning model for predicting coronary heart disease risk: Development and validation using insights from a Japanese population-based study,” JMIR Cardio, vol. 9, art. no. e68066, 2025, doi: 10.2196/68066.
- U. Rehman, S. Naseem, A. U. R. Butt, “Predicting coronary heart disease with advanced machine learning classifiers for improved cardiovascular risk assessment,” Scientific Reports, vol. 15, art. no. 13361, 2025, doi: 10.1038/s41598-025-96437-1.
- M. Rao, D. Ramesh, V. Sharma, “AttGRU-HMSI: Enhancing heart disease diagnosis using hybrid deep learning approach,” Scientific Reports, vol. 14, art. no. 7833, 2024, doi: 10.1038/s41598-024-56931-4.
- You, Y. Guo, J. J. Kang, “Development of machine learning-based models to predict 10-year risk of cardiovascular disease: A prospective cohort study,” Stroke and Vascular Neurology, vol. 8, no. 6, pp. 475–485, 2023, doi: 10.1136/svn-2023-002332.
- Li et al., “Improving cardiovascular risk prediction through machine learning modelling of irregularly repeated electronic health records,” European Heart Journal – Digital Health, vol. 5, no. 1, pp. 30–40, 2024, doi: 10.1093/ehjdh/ztad058.
- W. Hughes, J. Tooley, J. T. Soto, “A deep learning-based electrocardiogram risk score for long term cardiovascular death and disease,” npj Digit. Med., vol. 6, art. no. 169, 2023, doi: 10.1038/s41746-023-00916-6.
- Xi, H. Wang, N. Sun, “Machine learning outperforms traditional logistic regression and offers new possibilities for cardiovascular risk prediction: A study involving 143,043 Chinese patients with hypertension,” Frontiers in Cardiovascular Medicine, vol. 9, art. no. 1025705, 2022, doi: 10.3389/fcvm.2022.1025705.
- Y. Cho, S. H. Kim, S. H. Kang, “Pre-existing and machine learning-based models for cardiovascular risk prediction,” Scientific Reports, vol. 11, art. no. 8886, 2021, doi: 10.1038/s41598-021-88257-w.
- Khera, J. Haimovich, N. C. Hurley et al., “Use of machine learning models to predict death after acute myocardial infarction,” JAMA Cardiology, vol. 6, no. 6, pp. 633–641, 2021, doi: 10.1001/jamacardio.2021.0122.
- R. Guarneros-Nolasco, N. A. Cruz-Ramos, G. Alor-Hernández, L. Rodríguez-Mazahua, J. L. Sánchez-Cervantes, “Identifying the main risk factors for cardiovascular diseases prediction using machine learning algorithms,” Mathematics, vol. 9, no. 20, art. no. 2537, 2021, doi: 10.3390/math9202537.
- Yang, H. Wu, X. Jin, “Study of cardiovascular disease prediction model based on random forest in eastern China,” Scientific Reports, vol. 10, art. no. 5245, 2020, doi: 10.1038/s41598-020-62133-5.
- M. Alaa, T. Bolton, E. Di Angelantonio, J. H. F. Rudd, M. van der Schaar, “Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants,” PLOS ONE, vol. 14, no. 5, art. no. e0213653, 2019, doi: 10.1371/journal.pone.0213653.
- F. Weng, J. Reps, J. Kai, J. M. Garibaldi, N. Qureshi, “Can machine-learning improve cardiovascular risk prediction using routine clinical data?,” PLOS ONE, vol. 12, no. 4, art. no. e0174944, 2017, doi: 10.1371/journal.pone.0174944.
- J. Rousseeuw, C. Croux, “Alternatives to the median absolute deviation,” Journal of the American Statistical Association, vol. 88, no. 424, pp. 1273–1283, Dec. 1993, doi: 10.1080/01621459.1993.10476408.
- Y. J. Peng, K. L. Lee, G. M. Ingersoll, “An introduction to logistic regression analysis and reporting,” Journal of Educational Research, vol. 96, no. 1, pp. 3–14, 2002, doi: 10.1080/00220670209598786.
- S. Furey et al., “Support vector machine classification and validation of cancer tissue samples using microarray expression data,” Bioinformatics, vol. 16, no. 10, pp. 906–914, 2000, doi: 10.1093/bioinformatics/16.10.906.
- Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.
- Chen, C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2016, pp. 785–794, doi: 10.1145/2939672.2939785.
- Cover, P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967, doi: 10.1109/TIT.1967.1053964.
- H. John, P. Langley, “Estimating continuous distributions in Bayesian classifiers,” in Proc. 11th Conf. Uncertainty in Artificial Intelligence (UAI ’95), Montreal, QC, Canada, 1995, pp. 338–345, doi: 10.5555/2074158.2074196.
- Feurer, F. Hutter, “Hyperparameter optimization,” in Automated Machine Learning: Methods, Systems, Challenges, F. Hutter, L. Kotthoff, and J. Vanschoren, Eds. Cham, Switzerland: Springer, 2019, pp. 3–33, doi: 10.1007/978-3-030-05318-5_1.
- Nugraha, A. Sasongko, “Hyperparameter tuning on classification algorithm with grid search,” Sistemasi: Jurnal Sistem Informasi, vol. 11, no. 2, pp. 391–401, 2022, doi: 10.32520/stmsi.v11i2.1750.
- N. Ahmad, H. Fatima, S. Ullah, A. S. Saidi, “Efficient medical diagnosis of human heart diseases using machine learning techniques with and without GridSearchCV,” IEEE Access, vol. 10, pp. 80151–80173, 2022, doi: 10.1109/ACCESS.2022.3165792.
- Ogunpola, F. Saeed, S. Basurra, A. Albarrak, S. Qasem, “Machine learning-based predictive models for detection of cardiovascular diseases,” Diagnostics, vol. 14, no. 2, art. no. 144, 2024, doi: 10.3390/diagnostics14020144.
- S. Dunias, B. Van Calster, D. Timmerman, A.-L. Boulesteix, M. Van Smeden, “A comparison of hyperparameter tuning procedures for clinical prediction models: A simulation study,” Statistics in Medicine, vol. 43, no. 6, pp. 1119–1134, 2024, doi: 10.1002/sim.9932.
- J. Youden, “Index for rating diagnostic tests,” Cancer, vol. 3, no. 1, pp. 32–35, 1950, doi: 10.1002/1097-0142(1950).
- Fluss, D. Faraggi, B. Reiser, “Estimation of the Youden Index and its associated cutoff point,” Biometrical Journal, vol. 47, no. 4, pp. 458–472, 2005, doi: 10.1002/bimj.200410135.
- Varma, R. Simon, “Bias in error estimation when using cross-validation for model selection,” BMC Bioinformatics, vol. 7, art. no. 91, 2006, doi: 10.1186/1471-2105-7-91.
- Nadeau, Y. Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, pp. 239–281, 2003, doi: 10.1023/A:1024068626366.
- SaitoM. Rehmsmeier, “The precision-recall plot is more informative , han the ROC plot when evaluating binary classifiers on imbalanced datasets,” PLOS ONE, vol. 10, no. 3, art. no. e0118432, 2015, doi: 10.1371/journal.pone.0118432.
- Fawcett, “An introduction to ROC analysis,” Pattern Recognition Letters, vol. 27, no. 8, pp. 861–874, 2006, doi: 10.1016/j.patrec.2005.10.010.
- Davis, M. Goadrich, “The relationship between precision-recall and ROC curves,” in Proc. 23rd Int. Conf. Machine Learning (ICML), 2006, pp. 233–240, doi: 10.1145/1143844.1143874.
- W. Steyerberg, Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating, 2nd ed. Springer, 2019, doi: 10.1007/978-3-030-16399-0.
- Chicco, G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, art. no. 6, 2020, doi: 10.1186/s12864-019-6413-7.
- Penso, L. Frenkel, J. Goldberger, “Confidence calibration of a medical imaging classification system that is robust to label noise,” IEEE Transactions on Medical Imaging, vol. 43, no. 6, pp. 2050–2060, 2024, doi: 10.1109/TMI.2024.3353762.
- Van Calster, D. J. McLernon, M. van Smeden, “Calibration: The Achilles heel of predictive analytics,” BMC Medicine, vol. 17, art. no. 230, 2019, doi: 10.1186/s12916-019-1466-7.
- Kull, T. D. Filho, P. A. Flach, “Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration,” Electronic Journal of Statistics, vol. 11, pp. 5052–5080, 2017, doi: 10.1214/17-EJS1338SI.
- E. Barlow, H. D. Brunk, “The isotonic regression problem and its dual,” Journal of the American Statistical Association, vol. 67, no. 337, pp. 140–147, Mar. 1972, doi: 10.1080/01621459.1972.10481216.
- Gneiting, F. Balabdaoui, A. E. Raftery, “Probabilistic forecasts, calibration and sharpness,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 69, no. 2, pp. 243–268, Apr. 2007, doi: 10.1111/j.1467-9868.2007.00587.x.
- Bröcker, L. A. Smith, “Increasing the reliability of reliability diagrams,” Weather and Forecasting, vol. 22, no. 3, pp. 651–661, 2007, doi: 10.1175/WAF993.1.
- Assel, D. Sjoberg, A. Vickers, “The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models,” Diagnostic and Prognostic Research, vol. 1, 2017, doi: 10.1186/s41512-017-0020-3.
- R. Cox, “The regression analysis of binary sequences,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, pp. 215–242, 1958, doi: 10.1111/j.2517-6161.1958.tb00292.x.
- Efron, R. J. Tibshirani, An Introduction to the Bootstrap, 1st ed. Chapman and Hall/CRC, 1994, doi: 10.1201/9780429246593.
- D. Brown, T. T. Cai, A. DasGupta, “Interval estimation for a binomial proportion,” Statistical Science, vol. 16, no. 2, pp. 101–133, 2001, doi: 10.1214/ss/1009213286.
- J. Spiegelhalter, “Probabilistic prediction in patient management and clinical trials,” Statistics in Medicine, vol. 5, no. 5, pp. 421–433, Sep. 1986, doi: 10.1002/sim.4780050506.
- A. Fisher, The Design of Experiments. Oliver & Boyd, 1935.
- Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses, 3rd ed. Springer, 2005, doi: 10.1007/b138696.
- Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945, doi: 10.2307/3001968.
- Rainio, J. Teuho, R. Klén, “Evaluation metrics and statistical tests for machine learning,” Scientific Reports, vol. 14, art. no. 6086, 2024, doi: 10.1038/s41598-024-56706-x.
- F. Schisterman, N. J. Perkins, A. Liu, H. Bondell, “Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples,” Epidemiology, vol. 16, no. 1, pp. 73–81, 2005, doi: 10.1097/01.ede.0000147512.81966.ba.
- Rainio, J. Tamminen, M. S. Venäläinen, “Comparison of thresholds for a convolutional neural network classifying medical images,” International Journal of Data Science and Analytics, vol. 20, pp. 2093–2099, 2025, doi: 10.1007/s41060-024-00584-z.
- -T. Lin, C.-J. Lin, R. C. Weng, “A note on Platt’s probabilistic outputs for support vector machines,” Machine Learning, vol. 68, pp. 267–276, 2007, doi: 10.1007/s10994-007-5018-6.
- Böken, “On the appropriateness of Platt scaling in classifier calibration,” Information Systems, vol. 95, art. no. 101641, 2021, doi: 10.1016/j.is.2020.101641.
- P. Naeini, G. F. Cooper, M. Hauskrecht, “Obtaining well calibrated probabilities using Bayesian binning,” in Proc. 29th AAAI Conf. Artificial Intelligence (AAAI), Austin, TX, USA, 2015, pp. 2901–2907, doi: 10.1609/aaai.v29i1.9602.
- Huang, W. Li, F. Macheret, R. A. Gabriel, L. Ohno-Machado, “A tutorial on calibration measurements and calibration models for clinical prediction models,” Journal of the American Medical Informatics Association, vol. 27, no. 4, pp. 621–633, Apr. 2020, doi: 10.1093/jamia/ocz228.
- W. Steyerberg et al., “Assessing the performance of prediction models: A framework for traditional and novel measures,” Epidemiology, vol. 21, no. 1, pp. 128–138, 2010, doi: 10.1097/EDE.0b013e3181c30fb2.
- Jiang, M. Osl, J. Kim, L. Ohno-Machado, “Calibrating predictive model estimates to support personalized medicine,” Journal of the American Medical Informatics Association, vol. 19, no. 2, pp. 263–274, 2012, doi: 10.1136/amiajnl-2011-000291.
- Kuleshov, N. Fenner, S. Ermon, “Accurate uncertainties for deep learning using calibrated regression,” in Proc. 35th Int. Conf. Machine Learning (ICML), 2018, pp. 2796–2804, doi: 10.48550/arXiv.1807.00263.
- Krstajic, L. J. Buturovic, D. E. Leahy, “Cross-validation pitfalls when selecting and assessing regression and classification models,” Journal of Cheminformatics, vol. 6, art. no. 10, 2014, doi: 10.1186/1758-2946-6-10.
- C. Cawley, N. L. C. Talbot, “On over-fitting in model selection and subsequent selection bias in performance evaluation,” Journal of Machine Learning Research, vol. 11, pp. 2079–2107, 2010.
- Peter Adebayo Odesola, Adewale Alex Adegoke, Idris Babalola, “Comparative Analysis of Supervised Machine Learning Models for PCOS Prediction Using Clinical Data”, Journal of Engineering Research and Sciences, vol. 4, no. 6, pp. 16–26, 2025. doi: 10.55708/js0406003
- Peter Adebayo Odesola, Adewale Alex Adegoke, Idris Babalola, “Fire Type Classification in the USA Using Supervised Machine Learning Techniques”, Journal of Engineering Research and Sciences, vol. 4, no. 6, pp. 1–8, 2025. doi: 10.55708/js0406001
- Peter Adebayo Odesola, Adewale Alex Adegoke, Idris Babalola, “AI-Driven Digital Transformation: Challenges and Opportunities”, Journal of Engineering Research and Sciences, vol. 4, no. 4, pp. 8–19, 2025. doi: 10.55708/js0404002
- Peter Adebayo Odesola, Adewale Alex Adegoke, Idris Babalola, “Enhancing Python Code Embeddings: Fusion of Code2vec with Large Language Models”, Journal of Engineering Research and Sciences, vol. 4, no. 1, pp. 1–7, 2025. doi: 10.55708/js0401001
- Peter Adebayo Odesola, Adewale Alex Adegoke, Idris Babalola, “Enhancing Mental Health Support in Engineering Education with Machine Learning and Eye-Tracking”, Journal of Engineering Research and Sciences, vol. 3, no. 10, pp. 69–75, 2024. doi: 10.55708/js0310007
