Imputation and Hyperparameter Optimization in Cancer Diagnosis

Yi Liu; Wendy Wang; Haibo Wang

doi:10.55708/js0208001

Open AccessArticle

Imputation and Hyperparameter Optimization in Cancer Diagnosis

by Yi Liu¹, Wendy Wang^*² and Haibo Wang³

¹ Department of Computer and Information Science, University of Massachusetts Dartmouth, MA 02747 USA

² Department of Computer Science and Information Systems, University of North Alabama, Florence, AL 35632 USA

³ Division of International Business and Technology Studies, Texas A&M International University, TX 78041 USA

^*whom correspondence should be addressed. E-mail: hwang21@una.edu

Journal of Engineering Research and Sciences, Volume 2, Issue 8, Page # 1-18, 2023; DOI: 10.55708/js0208001

Keywords: Machine Learning, Cervical Cancer, Imputation, Hyperparameter Optimization

Received: 18 April 2023, Revised: 3 July 2023, Accepted: 17 August 2023, Published Online: 28 August 2023

(This article belongs to the Special Issue on SP3 (Special Issue on Multidisciplinary Sciences and Advanced Technology 2023) and the Section Medical Informatics (MDI))

Export Citations

Cite

APA Style
Liu, Y. , Wang, W. and Wang, H. (2023). Imputation and Hyperparameter Optimization in Cancer Diagnosis. Journal of Engineering Research and Sciences, 2(8), 1–18. https://doi.org/10.55708/js0208001

Chicago/Turabian Style
Yi Liu, Wendy Wang and Haibo Wang. "Imputation and Hyperparameter Optimization in Cancer Diagnosis." Journal of Engineering Research and Sciences 2, no. 8 (August 2023): 1–18. https://doi.org/10.55708/js0208001

IEEE Style
Y. Liu, W. Wang and H. Wang, "Imputation and Hyperparameter Optimization in Cancer Diagnosis," Journal of Engineering Research and Sciences, vol. 2, no. 8, pp. 1–18, Aug. 2023, doi: 10.55708/js0208001.

Download Now!

435 Downloads

Abstract

Full Text

References

Cited By

Metrics

Abstract

Full Text

References

Cited By

Metrics

Test	Algorithms	Imputation	Hyperparameter Optimization
1	19	N	N
2	15	N	Y
3	19	Y	N
4	15	Y	Y

Type	Machine learning algorithms
Unsupervised	Copula-Based Outlier Detection (COPOD) [24] K-Nearest Neighbor (KNN) [25] Subspace Outlier Detection (SUOD) [26]
Semi-supervised	Self Training (ST) [27] Label Propagation (LP)[28]
Supervised	Balanced Bagging [29] Adaptive Boosting (AdaBoost) [30] Balanced Random Forest [23] Light Gradient Boosting Machine (LightGBM) [31] Linear Discriminant Analysis (LDA) [32] Logistics Regression (LR) [33] Complement Naive Bayes (NB) [34] Neural Networks (NN) – Multi-Layer Perceptrons [35] ν-Support Vector Machines (NuSVM) [36] Random Forest (RF) [17] Support Vector Machine (SVM) [37] eXtreme Gradient Boosting (XGBoost) [38]

Test No.	Description	Results
1	3 unsupervised; 2 semi-supervised;	Table 5 presents the details of the results from Test 1. Fig. 2, 3, and 4 illustrate the AUROC curves.
	14 supervised;	Overall, supervised learning algorithms have performed
	No imputation;	the best in AUROC values, followed by semi-supervised,
	No hyperparameter op.	and then unsupervised. 1 supervised learning (BB) has the highest AUROC score of 0.90, followed by 3 supervised
		learning (B-RF, LDA and LR-B) with a score of 0.89; The
		highest Overall Accuracy of 95.93% is achieved by two su-
		pervised learning (BB and LightGBM), followed by one
		semi-supervised learning (ST) with accuracy of 95.48%. All
		19 algorithms have execution time of less than 0.77 second.
2	0 unsupervised; 1 semi-supervised;	Table 6 presents the details of the results from Test 2. Fig. 5 and 6 illustrate the AUROC curves.
	14 supervised;	5 supervised learning (BB, B-RF, LDA, LR-B and RF) have the
	No imputation;	best AUROC values between 0.89 and 0.9, among which LDA
	Hyperparameter op.	also has the one of the shortest processing time of only 0.37 second among all 15 algorithms. 13/14 supervised learning
		have Overall Accuracy of over 92.76%. 3 supervised learning
		(BB, LightGBM, and LR-B) and 1 semi-supervised learning
		(LP) have the highest Overall Accuracy of 95.93%. LP also
		has one of the shortest execution times, taking 0.86 seconds.
		Only 4 algorithms have execution time of less than 1 second,
		5 algorithms take more than 30 seconds, and XGBoost has
		the longest execution time of 97.89 seconds.
3	3 unsupervised; 2 semi-supervised;	Table 7 presents the details of the results from Test 3. Fig. 7, 8 , and 9 illustrate the AUROC curves.
	14 supervised;	The top performers in AUROC are 4 supervised learning –
	Imputation;	i.e., BB, B-RF, LDA, and LR-B with the values of 0.96, 0.95,
	No hyperparameter op.	0.95, and 0.95 respectively, among which LDA also has one of the shortest processing time(0.01 second). All but two
		algorithms (KNN and NB) have Overall Accuracy higher
		than 90%.
4	0 unsupervised; 1 semi-supervised;	Table 8 presents the details of the results from Test 4. Fig. 10 and 11 illustrate the AUROC curves.
	14 supervised;	8 out of 14 supervised learning (BB, AdaBoost, B-RF, LDA,
	Imputation;	LR-B, RF-B, SVM, and XGBoost) have AUROC values be-
	Hyperparameter op.	tween 0.95 and 0.96, among which LDA has the shortest execution time in test 4 (0.43 second). All 15 algorithms have
		Overall Accuracy of over 91.5%, 11 of them have processing
		time more than 1 second among which XGBoost has the
		longest execution time of 105.23 second.

Algorithms	Imputation (Most Frequent)		Imputation	(Random Forest)
Algorithms	Overall(%)	AUROC	Overall(%)	AUROC
BB	95.07	0.92	97.18	0.96
AdaBoost	93.66	0.68	95.77	0.74
B-RF	95.77	0.93	96.13	0.95
LightGBM	95.77	0.74	96.83	0.72
LDA	96.48	0.93	96.83	0.95
LR	96.13	0.75	97.54	0.87
LR-B	95.42	0.87	96.48	0.95
NB	89.44	0.74	87.32	0.90
NN	94.72	0.66	95.42	0.74
NuSVM	95.77	0.74	96.83	0.92
RF	96.48	0.77	96.48	0.75
RF-B	95.42	0.66	94.72	0.56
SVM	93.66	0.50	94.37	0.50
XGBoost	95.42	0.77	96.83	0.75
ST	96.82	0.79	98.94	0.91
LP	96.11	0.73	95.76	0.74
COPOD	90.81	0.76	89.25	0.71
KNN	87.99	0.63	87.10	0.65
SUOD	90.11	0.73	89.25	0.71

Algorithm	Time (sec)	False Neg.	False Pos.	Correct /221	Overall (%)	AUROC
BB	0.03	3	6	212	95.93%	0.90
AdaBoost	0.06	8	3	210	95.02%	0.76
B-RF	0.22	3	7	211	95.48%	0.89
LightGBM	0.06	7	2	212	95.93%	0.79
LDA	0.02	3	7	211	95.48%	0.89
LR	0.06	10	3	208	94.12%	0.70
LR-B	0.10	3	7	211	95.48%	0.89
NB	0.02	7	28	186	84.16%	0.73
NN	0.93	9	3	209	94.57%	0.73
NuSVM	0.02	9	3	209	94.57%	0.73
RF	0.14	11	2	208	94.12%	0.67
RF-B	0.16	11	1	209	94.57%	0.67
SVM	0.02	17	0	204	92.31%	0.50
XGBoost	0.08	6	4	211	95.48%	0.81
LP	0.01	8	3	210	95.02%	0.76
ST	0.01	7	3	211	95.48%	0.79
COPOD	0.01	16	11	194	87.78%	0.64
KNN	0.01	20	15	186	84.16%	0.51
SUOD	1.03	15	10	196	88.69%	0.67

Imputation and Hyperparameter Optimization in Cancer Diagnosis

Imputation and Hyperparameter Optimization in Cancer Diagnosis

1. Introduction

2. Method

2.1. Imputation

2.2. Hyperparameter Optimization

2.3. Dataset

2.4. Design of experiments

3. Results

3.1. Impact of Imputation

3.1.1. Imputation Strategies

3.1.2. Comparison of Results from Test 1 and Test 3

3.1.3. Comparison of Results from Test 2 and Test 4

3.2. Impact of hyperparameter optimization

3.2.1. Comparison of Results from Test 1 and Test 2

3.2.2. Comparison of Results from Test 3 and Test 4

3.3. Impact of hyperparameter optimization and imputation

3.3. Discussion

4. Conclusion

5. Availability of data and materials

Annexure

Citations by Dimensions

Citations by PlumX

Crossref Citations

Important Links

Copyright

Address

Share Link

Algorithm	Time (sec)	False Neg.	False Pos.	Correct /284	Overall (%)	AUROC
BB	0.03	1	8	275	96.83%	0.95
AdaBoost	0.08	9	4	271	95.42%	0.71
B-RF	0.20	1	9	274	96.48%	0.95
LightGBM	0.08	9	1	274	96.48%	0.72
LDA	0.02	1	8	275	96.83%	0.95
LR	0.06	4	3	277	97.54%	0.87
LR-B	0.11	1	9	274	96.48%	0.95
NB	0.02	1	35	248	87.32%	0.90
NN	0.80	7	5	272	95.77%	0.77
NuSVM	0.02	2	6	276	97.18%	0.93
RF	0.15	13	2	269	94.72%	0.59
RF-B	0.11	12	1	271	95.42%	0.62
SVM	0.01	16	0	268	94.37%	0.50
XGBoost	0.08	8	1	275	96.83%	0.75
LP	0.01	8	4	272	95.77%	0.74
ST	0.01	5	1	278	97.89%	0.84
COPOD	0.01	19	6	259	91.20%	0.78
KNN	0.02	26	13	245	86.27%	0.55
SUOD	1.12	20	7	257	90.49%	0.74

Algorithm	Time (sec)	False Neg.	False Pos.	Correct /284	Overall (%)	AUROC
BB	7.08	1	9	274	96.48%	0.95
AdaBoost	26.88	1	7	276	97.18%	0.96
B-RF	63.50	1	7	276	97.18%	0.96
LightGBM	26.69	7	1	276	97.18%	0.78
LDA	0.43	1	8	275	96.83%	0.95
LR	4.80	7	3	274	96.48%	0.78
LR-B	5.24	1	7	276	97.18%	0.96
NB	0.22	6	18	260	91.55%	0.78
NN	68.49	6	6	272	95.77%	0.80
NuSVM	0.86	3	7	274	96.48%	0.89
RF	36.07	15	0	269	94.72%	0.53
RF-B	37.11	1	6	277	97.54%	0.96
SVM	14.64	1	7	276	97.18%	0.96
XGBoost	104.79	1	7	276	97.18%	0.96
LP	0.93	8	0	276	97.18%	0.75