PDF Malware Detection: Toward Machine Learning Modelling with Explainability Analysis
Keywords:
PDF malware detection, ML, RF, SVM, DNN, explainability, cybersecurity, malicious PDF, classification algorithms, Kaggle datasetAbstract
In the digital age, PDF files are widely used for document sharing, but their popularity also makes them a target for malware attacks. This project, titled " Detecting Malware in PDFs: Advancing Machine Learning Models with Interpretability Assessment," aims the goal is to design and assess machine learning models aimed at identifying malware within PDF files. Utilizing a dataset from Kaggle, which contains labelled examples of malicious and benign PDFs, various algorithms including RF, C5.0, J48, SVM, AdaBoost, DNN, GBM, and KNN will be applied. The primary focus is on achieving high detection accuracy while also providing explainability to gain insight into how the models make decisions. By leveraging machine learning techniques, this project seeks to enhance cybersecurity measures, offering a robust solution to identify and mitigate potential threats embedded in PDF documents.
Downloads
References
Abu Al-Haija, Q., Odeh, A., & Qattous, H. (2022). PDF Malware Detection Based on Optimizable Decision Trees. Electronics 2022, Vol. 11, Page 3142, 11(19), 3142. https://doi.org/10.3390/ELECTRONICS11193142
Alam, S., Horspool, R. N., Traore, I., & Sogukpinar, I. (2015). A framework for metamorphic malware analysis and real-time detection. Computers and Security, 48, 212–233. https://doi.org/10.1016/J.COSE.2014.10.011
Alshamrani, S. S. (2022). Design and Analysis of Machine Learning Based Technique for Malware Identification and Classification of Portable Document Format Files. Security and Communication Networks, 2022(1), 7611741. https://doi.org/10.1155/2022/7611741
Aslan, O., & Samet, R. (2020). A Comprehensive Review on Malware Detection Approaches. IEEE Access, 8, 6249–6271. https://doi.org/10.1109/ACCESS.2019.2963724
Han, K. S., Lim, J. H., Kang, B., & Im, E. G. (2015). Malware analysis using visualized images and entropy graphs. International Journal of Information Security, 14(1), 1–14. https://doi.org/10.1007/S10207-014-0242-0
Hossain, G. M. S., Deb, K., Janicke, H., & Sarker, I. H. (2024). PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis. IEEE Access, 12, 13833–13859. https://doi.org/10.1109/ACCESS.2024.3357620
Islam, R., Tian, R., Batten, L. M., & Versteeg, S. (2013). Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Applications, 36(2), 646–656. https://doi.org/10.1016/J.JNCA.2012.10.004
Kang, A. R., Jeong, Y. S., Kim, S. L., & Woo, J. (2019). Malicious PDF detection model against adversarial attack built from benign PDF containing javascript. Applied Sciences (Switzerland), 9(22). https://doi.org/10.3390/APP9224764
Komatwar, R., & Kokare, M. (2021). A Survey on Malware Detection and Classification. Journal of Applied Security Research, 16(3), 390–420. https://doi.org/10.1080/19361610.2020.1796162
Liu, C., Lou, C., Yu, M., Yiu, S. M., Chow, K. P., Li, G., Jiang, J., & Huang, W. (2021). A novel adversarial example detection method for malicious PDFs using multiple mutated classifiers. Forensic Science International: Digital Investigation, 38. https://doi.org/10.1016/J.FSIDI.2021.301124
Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., & Staar, P. (2021). Robust PDF Document Conversion using Recurrent Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 35(17), 15137–15145. https://doi.org/10.1609/AAAI.V35I17.17777
Li, Y., Wang, X., Shi, Z., Zhang, R., Xue, J., & Wang, Z. (2022). Boosting training for PDF malware classifier via active learning. International Journal of Intelligent Systems, 37(4), 2803–2821. https://doi.org/10.1002/INT.22451
Maiorca, D., & Biggio, B. (2019a). Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware. IEEE Security and Privacy, 17(1), 63–71. https://doi.org/10.1109/MSEC.2018.2875879
Maiorca, D., & Biggio, B. (2019b). Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware. IEEE Security and Privacy, 17(1), 63–71. https://doi.org/10.1109/MSEC.2018.2875879
Maiorca, D., Giacinto, G., & Corona, I. (2012). A pattern recognition system for malicious PDF files detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7376 LNAI, 510–524. https://doi.org/10.1007/978-3-642-31537-4_40
Mao, Z., Fang, Z., Li, M., & Fan, Y. (2022). EvadeRL: Evading PDF Malware Classifiers with Deep Reinforcement Learning. Security and Communication Networks, 2022. https://doi.org/10.1155/2022/7218800
Muir, N. (2009). Working with Files and Folders. Windows® 7 Just the StepsTM for Dummies®, 25–35. https://doi.org/10.1002/9781118257562.CH3
Shijo, P. V., & Salim, A. (2015). Integrated static and dynamic analysis for malware detection. Procedia Computer Science, 46, 804–811. https://doi.org/10.1016/J.PROCS.2015.02.149
Singh, P., Tapaswi, S., & Gupta, S. (2020a). Malware Detection in PDF and Office Documents: A survey. Information Security Journal: A Global Perspective, 29(3), 134–153. https://doi.org/10.1080/19393555.2020.1723747
Singh, P., Tapaswi, S., & Gupta, S. (2020b). Malware Detection in PDF and Office Documents: A survey. Information Security Journal, 29(3), 134–153. https://doi.org/10.1080/19393555.2020.1723747
Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches using data mining techniques. Human-Centric Computing and Information Sciences, 8(1). https://doi.org/10.1186/S13673-018-0125-X
Šrndić, N., & Laskov, P. (2016). Hidost: a static machine-learning-based detector of malicious files. Eurasip Journal on Information Security, 2016(1). https://doi.org/10.1186/S13635-016-0045-0
Ucci, D., Aniello, L., & Baldoni, R. (2019). Survey of machine learning techniques for malware analysis. Computers and Security, 81, 123–147. https://doi.org/10.1016/J.COSE.2018.11.001
Wiseman, Y. (2019). Efficient Embedded Images in Portable Document Format (PDF). International Journal of Advanced Science and Technology, 124, 129–138. https://doi.org/10.33832/IJAST.2019.124.12
Zhang, J. (2018). MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection. http://arxiv.org/abs/1808.06991
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Science and Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.