PDF Malware Detection: Toward Machine Learning Modelling with Explainability Analysis

Authors

  • K Naresh Assistant Professor, Department of MCA, Annamacharya Institute of Technology and Sciences (AITS), Tirupati, Andhra Pradesh, India Author
  • Thukivakam Dharani PG Student, Department of MCA, Annamacharya Institute of Technology and Sciences (AITS), Tirupati, Andhra Pradesh, India Author

Keywords:

PDF malware detection, ML, RF, SVM, DNN, explainability, cybersecurity, malicious PDF, classification algorithms, Kaggle dataset

Abstract

In the digital age, PDF files are widely used for document sharing, but their popularity also makes them a target for malware attacks. This project, titled " Detecting Malware in PDFs: Advancing Machine Learning Models with Interpretability Assessment," aims the goal is to design and assess machine learning models aimed at identifying malware within PDF files. Utilizing a dataset from Kaggle, which contains labelled examples of malicious and benign PDFs, various algorithms including RF, C5.0, J48, SVM, AdaBoost, DNN, GBM, and KNN will be applied. The primary focus is on achieving high detection accuracy while also providing explainability to gain insight into how the models make decisions. By leveraging machine learning techniques, this project seeks to enhance cybersecurity measures, offering a robust solution to identify and mitigate potential threats embedded in PDF documents.

Downloads

Download data is not yet available.

References

Abu Al-Haija, Q., Odeh, A., & Qattous, H. (2022). PDF Malware Detection Based on Optimizable Decision Trees. Electronics 2022, Vol. 11, Page 3142, 11(19), 3142. https://doi.org/10.3390/ELECTRONICS11193142

Alam, S., Horspool, R. N., Traore, I., & Sogukpinar, I. (2015). A framework for metamorphic malware analysis and real-time detection. Computers and Security, 48, 212–233. https://doi.org/10.1016/J.COSE.2014.10.011

Alshamrani, S. S. (2022). Design and Analysis of Machine Learning Based Technique for Malware Identification and Classification of Portable Document Format Files. Security and Communication Networks, 2022(1), 7611741. https://doi.org/10.1155/2022/7611741

Aslan, O., & Samet, R. (2020). A Comprehensive Review on Malware Detection Approaches. IEEE Access, 8, 6249–6271. https://doi.org/10.1109/ACCESS.2019.2963724

Han, K. S., Lim, J. H., Kang, B., & Im, E. G. (2015). Malware analysis using visualized images and entropy graphs. International Journal of Information Security, 14(1), 1–14. https://doi.org/10.1007/S10207-014-0242-0

Hossain, G. M. S., Deb, K., Janicke, H., & Sarker, I. H. (2024). PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis. IEEE Access, 12, 13833–13859. https://doi.org/10.1109/ACCESS.2024.3357620

Islam, R., Tian, R., Batten, L. M., & Versteeg, S. (2013). Classification of malware based on integrated static and dynamic features. Journal of Network and Computer Applications, 36(2), 646–656. https://doi.org/10.1016/J.JNCA.2012.10.004

Kang, A. R., Jeong, Y. S., Kim, S. L., & Woo, J. (2019). Malicious PDF detection model against adversarial attack built from benign PDF containing javascript. Applied Sciences (Switzerland), 9(22). https://doi.org/10.3390/APP9224764

Komatwar, R., & Kokare, M. (2021). A Survey on Malware Detection and Classification. Journal of Applied Security Research, 16(3), 390–420. https://doi.org/10.1080/19361610.2020.1796162

Liu, C., Lou, C., Yu, M., Yiu, S. M., Chow, K. P., Li, G., Jiang, J., & Huang, W. (2021). A novel adversarial example detection method for malicious PDFs using multiple mutated classifiers. Forensic Science International: Digital Investigation, 38. https://doi.org/10.1016/J.FSIDI.2021.301124

Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., & Staar, P. (2021). Robust PDF Document Conversion using Recurrent Neural Networks. Proceedings of the AAAI Conference on Artificial Intelligence, 35(17), 15137–15145. https://doi.org/10.1609/AAAI.V35I17.17777

Li, Y., Wang, X., Shi, Z., Zhang, R., Xue, J., & Wang, Z. (2022). Boosting training for PDF malware classifier via active learning. International Journal of Intelligent Systems, 37(4), 2803–2821. https://doi.org/10.1002/INT.22451

Maiorca, D., & Biggio, B. (2019a). Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware. IEEE Security and Privacy, 17(1), 63–71. https://doi.org/10.1109/MSEC.2018.2875879

Maiorca, D., & Biggio, B. (2019b). Digital Investigation of PDF Files: Unveiling Traces of Embedded Malware. IEEE Security and Privacy, 17(1), 63–71. https://doi.org/10.1109/MSEC.2018.2875879

Maiorca, D., Giacinto, G., & Corona, I. (2012). A pattern recognition system for malicious PDF files detection. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7376 LNAI, 510–524. https://doi.org/10.1007/978-3-642-31537-4_40

Mao, Z., Fang, Z., Li, M., & Fan, Y. (2022). EvadeRL: Evading PDF Malware Classifiers with Deep Reinforcement Learning. Security and Communication Networks, 2022. https://doi.org/10.1155/2022/7218800

Muir, N. (2009). Working with Files and Folders. Windows® 7 Just the StepsTM for Dummies®, 25–35. https://doi.org/10.1002/9781118257562.CH3

Shijo, P. V., & Salim, A. (2015). Integrated static and dynamic analysis for malware detection. Procedia Computer Science, 46, 804–811. https://doi.org/10.1016/J.PROCS.2015.02.149

Singh, P., Tapaswi, S., & Gupta, S. (2020a). Malware Detection in PDF and Office Documents: A survey. Information Security Journal: A Global Perspective, 29(3), 134–153. https://doi.org/10.1080/19393555.2020.1723747

Singh, P., Tapaswi, S., & Gupta, S. (2020b). Malware Detection in PDF and Office Documents: A survey. Information Security Journal, 29(3), 134–153. https://doi.org/10.1080/19393555.2020.1723747

Souri, A., & Hosseini, R. (2018). A state-of-the-art survey of malware detection approaches using data mining techniques. Human-Centric Computing and Information Sciences, 8(1). https://doi.org/10.1186/S13673-018-0125-X

Šrndić, N., & Laskov, P. (2016). Hidost: a static machine-learning-based detector of malicious files. Eurasip Journal on Information Security, 2016(1). https://doi.org/10.1186/S13635-016-0045-0

Ucci, D., Aniello, L., & Baldoni, R. (2019). Survey of machine learning techniques for malware analysis. Computers and Security, 81, 123–147. https://doi.org/10.1016/J.COSE.2018.11.001

Wiseman, Y. (2019). Efficient Embedded Images in Portable Document Format (PDF). International Journal of Advanced Science and Technology, 124, 129–138. https://doi.org/10.33832/IJAST.2019.124.12

Zhang, J. (2018). MLPdf: An Effective Machine Learning Based Approach for PDF Malware Detection. http://arxiv.org/abs/1808.06991

Downloads

Published

10-05-2025

Issue

Section

Research Articles