AI-Driven Data Lake Optimization: Integrating Quality Monitoring with Intelligent Physical Design Decisions
Journal of Engineering Research and Sciences, Volume 5, Issue 3, Page # 1-13, 2026; DOI: 10.55708/js0503001
Keywords: Data Lake Optimization, Machine Learning, Reinforcement Learning, Data Quality Monitoring, Physical Database Design, Drift Detection
(This article belongs to the Section Artificial Intelligence – Computer Science (AIC))
Export Citations
Cite
Deva, S. and Chintacunta, S. N. R. (2026). AI-Driven Data Lake Optimization: Integrating Quality Monitoring with Intelligent Physical Design Decisions. Journal of Engineering Research and Sciences, 5(3), 1–13. https://doi.org/10.55708/js0503001
Sowjanya Deva and Surya Narayana Reddy Chintacunta. "AI-Driven Data Lake Optimization: Integrating Quality Monitoring with Intelligent Physical Design Decisions." Journal of Engineering Research and Sciences 5, no. 3 (March 2026): 1–13. https://doi.org/10.55708/js0503001
S. Deva and S.N.R. Chintacunta, "AI-Driven Data Lake Optimization: Integrating Quality Monitoring with Intelligent Physical Design Decisions," Journal of Engineering Research and Sciences, vol. 5, no. 3, pp. 1–13, Mar. 2026, doi: 10.55708/js0503001.
Cloud data lakes require continuous optimization across multiple dimensions: physical design (partitioning, compression), query execution, and data quality assurance. This paper presents AIDALOS (AI-Driven Autonomous Data Lake Optimization System), a framework that integrates quality monitoring with physical optimization decisions. The system uses reinforcement learning to adapt monitoring intensity and trigger physical design changes based on detected anomalies, drift patterns, and workload shifts. Deep Q-networks learn when to repartition tables, ensemble models select compression codecs based on data characteristics and access patterns, and neural cost estimators improve query plan selection. Our evaluation across five machine learning pipelines demonstrates that this integrated approach achieves 47% storage cost reduction and 62% query performance improvement compared to static configurations, with 89.9% F1-score for quality issue detection. The key insight is that quality signals drift detection, anomaly patterns, and workload changes should directly inform physical optimization decisions rather than treating these as separate concerns.
- D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden technical debt in machine learning systems,” Advances in Neural Information Processing Systems, vol. 28, pp. 2503–2511, 2015.
- J. Dixon, “Data lakes: a new generation of data repositories,” Proceedings of the ACM SIGMOD Workshop on Data Analytics in the Cloud, 2010.
- Sharma, V. Kumar, and R. Gupta, “Modern data lakes: a conceptual framework,” IEEE Access, vol. 9, pp. 127876–127891, 2021, doi:10.1109/ACCESS.2021.3112517.
- M. Armbrust, T. Das, S. Zhu, R. Xin, B. Ghodsi, J. Stoica, and M. Zaharia, “Delta lake: high-performance ACID table storage,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411–3424, 2020, doi:10.14778/3415478.3415560.
- M. Armbrust, J. Shi, A. Jindal, G. K. Lee, K. Xin, M. Zaharia, and I. Stoica, “Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics,” Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2021.
- V. Prashanth, S. Das, J. Li, and V. Narasayya, “Apache hudi: the case for incremental processing on big data,” IEEE Data Engineering Bulletin, vol. 44, no. 1, pp. 13–27, 2021.
- R. Blue, D. Petersohn, A. Reeves, and M. Rodgers, “Apache iceberg: a modern table format for big data,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3411–3424, 2020.
- T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case for learned index structures,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 489–504, 2018, doi:10.1145/3183713.3196909.
- R. Marcus, P. Negi, H. Mao, C. Zhang, N. Tatbul, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Polyzotis, “Neo: a learned query optimizer,” Proceedings of the VLDB Endowment, vol. 12, no. 11, pp. 1705–1718, 2019.
- Kipf, T. Kipf, B. Radke, and V. Markl, “Learned cardinalities: estimating correlated joins with deep learning,” Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2019.
- S. Chaudhuri and V. Narasayya, “An efficient cost-driven index selection tool for Microsoft SQL Server,” Proceedings of the VLDB Conference, pp. 146–155, 1997.
- N. Bruno and S. Chaudhuri, “Automatic physical database tuning: a relaxation-based approach,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 227–238, 2005, doi:10.1145/1066157.1066187.
- Schirmer, T. Neumann, and A. Kemper, “Workload-driven horizontal partitioning and pruning for large OLTP systems,” Proceedings of the IEEE ICDE Workshops, pp. 146–151, 2018.
- Z. Abedjan, L. Golab, and F. Naumann, “Data profiling,” Synthesis Lectures on Data Management, vol. 10, no. 4, pp. 1–154, 2018, doi:10.2200/S00838ED1V01Y201808DTM045.
- J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM Computing Surveys, vol. 46, no. 4, pp. 1–37, 2014, doi:10.1145/2523813.
- R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: a survey,” arXiv preprint arXiv:1901.03407, 2019.
- Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. Mühlbauer, S. Tozer, and D. Stonebraker, “Self-driving database management systems,” Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2017.
- T. Kraska, M. Alizadeh, A. Beutel, E. H. Chi, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan, “SageDB: a learned database system,” Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2019.
- D. Van Aken, A. Pavlo, G. Gordon, and B. Zhang, “Automatic database management system tuning through large-scale machine learning,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1009–1024, 2017, doi:10.1145/3035918.3064029.
- K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, Apr. 2002, doi:10.1109/4235.996017.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015, doi:10.1038/nature14236.
- R. Sutton and A. Barto, Reinforcement learning: an introduction, 2nd ed., Cambridge, MA, USA: MIT Press, 2018.
- Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling network architectures for deep reinforcement learning,” Proceedings of the International Conference on Machine Learning, pp. 1995–2003, 2016.
- D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” Proceedings of the International Conference on Learning Representations, 2015.
- T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” Proceedings of the International Conference on Learning Representations, 2016.
- E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1–2, pp. 100–115, 1954, doi:10.1093/biomet/41.1-2.100.
- Bifet and R. Gavaldà, “Learning from time-changing data with adaptive windowing,” Proceedings of the SIAM International Conference on Data Mining, 2007.
- Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, pp. 723–773, 2012.
- T. Chen and C. Guestrin, “XGBoost: a scalable tree boosting system,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016, doi:10.1145/2939672.2939785.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” Proceedings of the International Conference on Learning Representations, 2014.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi:10.1162/neco.1997.9.8.1735.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” Proceedings of the International Conference on Learning Representations, 2015.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” Proceedings of the International Conference on Learning Representations, 2017.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Apache Spark: a unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016, doi:10.1145/2934664.
- R. Marcus, P. Negi, H. Mao, C. Zhang, N. Tatbul, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Polyzotis, “Bao: making learned query optimization practical,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1275–1288, 2021.
- Sowjanya Deva, Surya Narayana Reddy Chintacunta, “Harnessing the Power of Machine Learning and Sensor Detection in a Simulation for the Design of Smart Date Harvesting Robot”, Journal of Engineering Research and Sciences, vol. 5, no. 2, pp. 1–8, 2026. doi: 10.55708/js0502001
- Sowjanya Deva, Surya Narayana Reddy Chintacunta, “Model Uncertainty Quantification: A Post Hoc Calibration Approach for Heart Disease Prediction “, Journal of Engineering Research and Sciences, vol. 4, no. 12, pp. 25–54, 2025. doi: 10.55708/js0412003
- Sowjanya Deva, Surya Narayana Reddy Chintacunta, “Comparative Analysis of Supervised Machine Learning Models for PCOS Prediction Using Clinical Data”, Journal of Engineering Research and Sciences, vol. 4, no. 6, pp. 16–26, 2025. doi: 10.55708/js0406003
- Sowjanya Deva, Surya Narayana Reddy Chintacunta, “Fire Type Classification in the USA Using Supervised Machine Learning Techniques”, Journal of Engineering Research and Sciences, vol. 4, no. 6, pp. 1–8, 2025. doi: 10.55708/js0406001
- Sowjanya Deva, Surya Narayana Reddy Chintacunta, “AI-Driven Digital Transformation: Challenges and Opportunities”, Journal of Engineering Research and Sciences, vol. 4, no. 4, pp. 8–19, 2025. doi: 10.55708/js0404002
